Random Forest Classifier
Machine Learning - The basics of building a decent binary classifier on tabular data
Goal
The goal is to predict whether a passenger on the Titanic survived or not. The applications for binary classification are endless and could be applied to many real world problems. Does this patient have this disease? Will this customer Churn? Will price go up? These are just a few examples.
The purpose is to give a general guide to classification. If you get through this and want more detail, I highly recommend checking out the Tabular Chapter of Deep Learning for Coders with fastai & Pytorch by Jeremy Howard and Sylvain Gugger. The book primarily focuses on deep learning, though decision trees are covered for tabular data. All of the material in this guide and more is covered in much greater detail in that book.
https://www.amazon.com/Deep-Learning-Coders-fastai-PyTorch/dp/1492045527
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
import pandas as pd
import numpy as np
from fastai2.tabular.all import *
from fastai2 import *
from sklearn.model_selection import GridSearchCV
from dtreeviz.trees import *
from scipy.cluster import hierarchy as hc
df = sns.load_dataset('titanic')
df.head()
df.drop('survived',axis = 1, inplace=True)
dep_var = 'alive'
Training and Validation Set Split
Best practice is to minimally have a training and validation set. Those are the 2 that we will use for this tutorial.
- Training Set: This is what the model actually trains on
- Validation Set: This is used to gauge success of the Training
- Test Set: This is a held out of the total process to be an additional safeguard against overfitting
cond = np.random.rand(len(df))>.2
train = np.where(cond)[0]
valid = np.where(~cond)[0]
splits = (list(train),list(valid))
df['class'].unique()
classes = 'First','Second','Third'
df['class'] = df['class'].astype('category')
df['class'].cat.set_categories(classes, ordered=True, inplace=True)
procs = [Categorify, FillMissing]
cont,cat = cont_cat_split(df, 1, dep_var=dep_var)
to = TabularPandas(df, procs, cat, cont, y_names=dep_var, splits=splits)
Let's take a look at the training and validation sets and make sure we have a good split of each.
len(to.train),len(to.valid)
We can now take a look and see that while we see all the same data, behind the scenes it is all numeric. This is exactly what we need for our random forest.
to.show(3)
to.items.head(3)
xs,y = to.train.xs,to.train.y
valid_xs,valid_y = to.valid.xs,to.valid.y
m = RandomForestClassifier(n_estimators=100)
m = m.fit(xs,y)
from sklearn.metrics import confusion_matrix
confusion_matrix(y,m.predict(xs))
Looking pretty good! Only 6 wrong. Let's see how it did on the validation set.
confusion_matrix(valid_y,m.predict(valid_xs))
Still way better than 50/50, but not quite as good. This is because the model did not train based on this validation data so it doesn't perform nearly as well.
Model Tuning - Grid Search
We made our first model, and it doesn't seem to predict as well as we would like. Let's do something about that.
We are going to do a grid search. There are many more sophisticated ways to find parameters (maybe a future post), but the grid search is easy to understand. Basically you pick some ranges, and you try them all to see what works best.
We will use the built in gridsearch. All we need to do is define the range of parameters, and let it find the best model.
parameters = {'n_estimators':range(10,20,20),
'max_depth':range(10,20,20),
'min_samples_split':range(2,20,1),
'max_features':['auto','log2']}
clf = GridSearchCV(RandomForestClassifier(), parameters, n_jobs=-1)
clf.fit(xs,y)
confusion_matrix(y,clf.best_estimator_.predict(xs))
confusion_matrix(valid_y,clf.best_estimator_.predict(valid_xs))
Model Minimizing
Now that we have good results with a tuned model, we may want to simplify the model. If we can simplify the model without significantly impacting accuracy, that's good for many reasons.
- The model is easier to understand
- Fewer variables means fewer data quality issues and more focused data quality efforts
- It takes less resources and time to run
def rf_feat_importance(m, df):
return pd.DataFrame({'cols':df.columns, 'imp':m.feature_importances_}
).sort_values('imp', ascending=False)
fi = rf_feat_importance(m, xs)
fi[:5]
Alright so we see that the most important variable is how much the passenger paid for their fare. Lovely.
def plot_fi(fi):
return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)
plot_fi(fi[:30]);
to_keep = fi[fi.imp>0.045].cols
len(to_keep)
xs_imp = xs[to_keep]
valid_xs_imp = valid_xs[to_keep]
clf = GridSearchCV(RandomForestClassifier(), parameters, n_jobs=-1)
clf.fit(xs_imp,y)
Results
Now we see with only 8 features we still get pretty good results on on validation set.
Now the question is whether this small loss in accuracy outweighed by a simpler and more efficient model? That is a business question more than it is a data science question.
If you are detecting COVID-19, you probably want it to be as accurate as possible. If you are going to predict whether someone is a cat or a dog person based on a survey for marketing purposes, small changes in accuracy probably are not as critical.
confusion_matrix(y,clf.best_estimator_.predict(xs_imp))
confusion_matrix(valid_y,clf.best_estimator_.predict(valid_xs_imp))
clf.best_estimator_
def cluster_columns(df, figsize=(10,6), font_size=12):
corr = np.round(scipy.stats.spearmanr(df).correlation, 4)
corr_condensed = hc.distance.squareform(1-corr)
z = hc.linkage(corr_condensed, method='average')
fig = plt.figure(figsize=figsize)
hc.dendrogram(z, labels=df.columns, orientation='left', leaf_font_size=font_size)
plt.show()
cluster_columns(xs_imp)
Now from the chart above, we can clearly see that class and pclass are completely redundant. We see the sex and adult_male has some redundancy as well. This makes sense as part of the adult_male column is the sex. Let's go ahead and drop one of the class or pclass variables (they are redundant so doesn't matter which).
xs_imp = xs_imp.drop('pclass', axis=1)
valid_xs_imp = valid_xs_imp.drop('pclass', axis=1)
xs_imp.head()
clf.fit(xs_imp,y)
Ok, so now on to variables that are not completely redundant. Let's experiment with removing some columns and see what we get. We will use accuracy for our metric.
Here is out baseline:
print("accuracy: ")
(confusion_matrix(valid_y,clf.best_estimator_.predict(valid_xs_imp))[0,0] +\
confusion_matrix(valid_y,clf.best_estimator_.predict(valid_xs_imp))[1,1] )/\
confusion_matrix(valid_y,clf.best_estimator_.predict(valid_xs_imp)).sum()
def get_accuracy(x,y,valid_x,valid_y):
m = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=10, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=4,
min_weight_fraction_leaf=0.0, n_estimators=10,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
m.fit(xs_imp,y)
print((confusion_matrix(valid_y,m.predict(valid_xs_imp))[0,0] +\
confusion_matrix(valid_y,m.predict(valid_xs_imp))[1,1] )/\
confusion_matrix(valid_y,m.predict(valid_xs_imp)).sum())
We will now loop through each of the remaining variables and train a model and print out the accuracy score.
Judging by the scores below, removing any 1 variable does not significantly reduce the accuracy. This means that we have redundant columns that can likely be trimmed. Sex seems to be a column we would definitely keep as removing it have the most impact on accuracy.
From this we can remove variables and iterate through to continue simplifying as much as possible.
variables = list(xs_imp.columns)
for variable in variables:
print('drop '+variable+' accuracy:')
get_accuracy(xs_imp.drop(variable, axis=1),
y,
valid_xs_imp.drop(variable, axis=1),
valid_y)