Intro

In this blog we are going to do an image classification to take dog pictures and predict the breed. This is considered 'fine grain' because the difference between classes is fairly minimal. Classifying between breeds of dogs is fine grain, classifying whether something is a dog or a cup is not.

To do this we are going to use fastaiv2, which is the new version of fastai that will come out in July. The purpose of this post is to introduce a few key concepts that will be useful as we move into harder problems

Setup

Library Import and Dataset Download

from fastai2.vision.all import *

seed = 42

# Download and get path for dataseet
path = untar_data(URLs.PETS) #Sample dataset from fastai2
path.ls()
(#2) [Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/images'),Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/annotations')]
img = (path/'images').ls()[0]
img
Path('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/British_Shorthair_154.jpg')

Data Setup

Data Blocks and data loaders are convenient ways that fastai helps us manage an load data. There is a lot going on in DataBlock so I am going to break it down piece by piece.

DataBlock

  • Blocks: What is our data? x is our images (ImageBlock) and y is our categories (CategoryBlock). In this case each image will get a dog breed as the category.

  • get_items: How do we get our data (x)? We use the predefined get_image_files for this, though we can give it something custom if needed.

  • splitter: How should we create the validation set? The splitter splits our data into test and validation sets. The default 20% validation set is just fine, but we define a seed so it is reproducable.

  • get_y: How do we get our dependent variable (y)? In this care we are going to get it from the file name. With using_attr, we can apply a function to an attribute of the file (name). So here we are using regex on the file name to get y.

  • item_tfms: What transformations do we need to do before packing things up to be sent to the GPU? In this case resizing it to 460.

  • batch_tfms: What transformations do we want to do in batches on the GPU? There are many default transforms, and we are specifying a few ourselves (min_scale and size).

pets = DataBlock(
    blocks = (ImageBlock, CategoryBlock),
    get_items = get_image_files,
    splitter= RandomSplitter(valid_pct = 0.2, seed=seed),
    get_y= using_attr(RegexLabeller(r'(.+)_\d+.jpg$'),'name'),
    item_tfms=Resize(460),
    batch_tfms=aug_transforms(min_scale = 0.9,size=224)
    )

dataloader

The dataloader is what we will actually interact with. In the DataBlock we defined lots of things we need to do to get and transform images, but not where to get them from. We define that in the dataloade

dls = pets.dataloaders(path/"images")
/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py:2854: UserWarning: The default behavior for interpolate/upsample with float scale_factor will change in 1.6.0 to align with other frameworks/libraries, and use scale_factor directly, instead of relying on the computed output size. If you wish to keep the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. 
  warnings.warn("The default behavior for interpolate/upsample with float scale_factor will change "
dls.show_batch()

Training the Model

Get a basic model working

In 2 lines of code I am going to create and train a basic model. There's a couple things to note:

  1. I am using the dls from the previous step. This is where we defined how to load the data, how to label it, data augmentation, training/validation split, etc.
  2. I can also pass standard architectures. "Resnet" is a common architecture for image classification. "34" signifies that it has 34 layers. If you wish to understand what a layer is, please check out the Image Classifier Basics blog posts that build a simple 1 layer net.
  3. I set a metric, but use the default loss metric.
    Note: Loss is what the model uses to train on. Error rate is just for our reference. Accuracy and error rates make very poor choices for loss function because they have either 0 or infinite slope, so calculating the gradient/path value/derivative is not meaningful. This is a prime example of why good functions for computers to understand what’s going on and good functions for peope to understand what’s going on can be very different things.
learn = cnn_learner(dls, resnet34, metrics=error_rate)

Next we are going to fine tune the model.

If we were starting from scratch when training a model we will train every layer. Fine tuning is about training the final layer(s) and leaving the rest intact. Previous layers were set using weights via transfer learning. What this means is that a model was trained to be able to detect and classify a bunch of different objects. The earlier layers of the neural network detect things that are common to lots of images (ie circles, edges, corners, etc.). These don't need to change much generally. The last layer is predicting the specific thing, in our case pet breeds. This is what we need to change.

Note: A fun thought expirament on understanding why transfer learning works is to think about elements that you need to identify basically everything that you take for granted, and try to imagininng the world and objects around you if you were missing some basic concepts. For example, what if you did not have the ability to tell the diference vs a surface and and edge? Or what if you couldn’t tell the difference between something that is straight and curved? Or what if circular shaped, square shaped objects, and traingular shaped objects all looked the same to you? What if you could not recognize any pattern - what would you think of a tile floor if you had no ability to comprehend that the tiles are a pattern? How could you function if you couldn’t see corners?
learn.fine_tune(3)
epoch train_loss valid_loss error_rate time
0 1.519997 0.311546 0.106901 01:05
epoch train_loss valid_loss error_rate time
0 0.472248 0.341488 0.105548 01:23
1 0.374747 0.241689 0.076455 01:22
2 0.230108 0.202105 0.066306 01:22
learn.recorder.plot_loss()

We see out validation loss improves significantly with our error rate. We will use this error rate at our baseline.

Note: A common misconception is when training loss is lower than validation loss. This is not the case. You cannot be overfitting as unless your validation scores get worse. In a well tuned model, training loss will almost always be lower than validation loss.
Let's take a look at our model, then see if we can improve it!

Look at results

First lets look at some pictures. I think it's always good to actually loook at some prediciton the model is making.

learn.show_results()
/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py:2854: UserWarning: The default behavior for interpolate/upsample with float scale_factor will change in 1.6.0 to align with other frameworks/libraries, and use scale_factor directly, instead of relying on the computed output size. If you wish to keep the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. 
  warnings.warn("The default behavior for interpolate/upsample with float scale_factor will change "

Many times in classification we look at a confusion matrix. As you can see, when we start having a lot of classes, it is really hard to make anything meaningful out of the confusion matrix. There's just too many classes.

Next, we will take a look at some more specif data. Let's start with a high level confusion matrix.

interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix(figsize=(12,12), dpi=60)

Instead, we look at top losses to see where our model was most wrong. We also look at the 'most confused' function whichprints which classes it gets confused on most often.

interp.plot_top_losses(9, figsize=(15,10))
interp.most_confused(min_val=3)
[('staffordshire_bull_terrier', 'american_pit_bull_terrier', 5),
 ('British_Shorthair', 'Russian_Blue', 4),
 ('Ragdoll', 'Birman', 4),
 ('beagle', 'basset_hound', 4),
 ('Birman', 'Ragdoll', 3),
 ('Egyptian_Mau', 'Bengal', 3),
 ('american_pit_bull_terrier', 'american_bulldog', 3),
 ('havanese', 'yorkshire_terrier', 3)]

Make a Better Model

Now that we have a baseline using the defaults, let's see what we can do to improve it. We will talk about a few main topics.

  • Freezing and Unfreezing for training
  • Learning Rate Finder
  • Discriminate learning rrate
  • Architecture

Learning Rate Finder

The Learning Rate Finder is very important because setting a good learning rate can make or break a model. We will use it multiple times, and it will come up in essentially every deep learning project. It is good to spend some time to understand what it is showing and expirament.

The learning rate finder (lr_find) gives us 3 things:

  • lr_min: This is the learning rate that gives us the minimum loss in our graph. 1 common rule of thumb is to divide this by 10 and use that as your learning rate.
  • lr_steep: This is the steepest point on our graph. Another rule of thumb is to make this your learning rate. Interestingly enough, these 2 rules of thumb often end up with very similar results.
  • graph: The graph is really what i use when determining a learning rate. At the beginning of the graph we see very little reduction in loss. At the end of the graph we see loss spiking. Obviously nether of those are good. In reality we want our learning rate to be somewhere in the middle-ish of that steep decline. This is in line with our 2 rule of thumbs.
learn = cnn_learner(dls, resnet34, metrics=error_rate)
lr_min,lr_steep = learn.lr_find()
print(f"Minimum/10: {lr_min:.2e}, steepest point: {lr_steep:.2e}")
Minimum/10: 1.00e-02, steepest point: 3.63e-03

Now that we have our graph, let's train our model with this learning rate.

What's the difference between fine tune and fit one cycle?

learn.fit_one_cycle(3, 3e-3)
epoch train_loss valid_loss error_rate time
0 1.108581 0.351311 0.108931 01:03
1 0.510538 0.250211 0.078484 01:04
2 0.329532 0.207013 0.063599 01:04
/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py:2854: UserWarning: The default behavior for interpolate/upsample with float scale_factor will change in 1.6.0 to align with other frameworks/libraries, and use scale_factor directly, instead of relying on the computed output size. If you wish to keep the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. 
  warnings.warn("The default behavior for interpolate/upsample with float scale_factor will change "

Unfreezing

Unfreezing a model is about training the full model. We spoke earlier about fine_tune only training that last layers. This is a great start and we want to train the last layers more than the early layers, but we still want to train the early layers. Unfreeze allows us to do this. Now that we unfroze the model and are going to train the model more, we will need to do our learning rate finder to pick a good learning rate again.

learn.unfreeze()
learn.lr_find()
SuggestedLRs(lr_min=1.584893179824576e-05, lr_steep=5.754399353463668e-06)

Discriminative Learning Rates

Discriminative Learning Rates means that we are going to use a different learning rate for each layer. We have previously really been training the final layers of the model. We are now ready to udate all the weights and biases, including the ones that were set by our transfer learning. While we do want to train the whole model, we don't want to train it at the same speed. We want more changes in the end (ie figuring out exactly which breed it is) and less changes in the early layers (ie identifying lines). This isn't so different than how people work. Learning a new thing (ie a type of dog breed) is much easier than improving my fundamental understanding of the world around me.

We will fit this now for 6 epochs. The first layers will have 1e-6 learning rate. The final layers will have 1e-4. Middle layers will be between those 2 numbers. We can see we get down to just under a 6% error rate. 94% accuracy - not bad!

learn.fit_one_cycle(6, lr_max=slice(1e-6,1e-4))
epoch train_loss valid_loss error_rate time
0 0.260102 0.199174 0.064953 01:22
1 0.243101 0.191330 0.062246 01:23
2 0.211765 0.187266 0.059540 01:22
3 0.188451 0.184359 0.063599 01:22
4 0.181025 0.180899 0.060217 01:22
5 0.175231 0.181373 0.059540 01:22
/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py:2854: UserWarning: The default behavior for interpolate/upsample with float scale_factor will change in 1.6.0 to align with other frameworks/libraries, and use scale_factor directly, instead of relying on the computed output size. If you wish to keep the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. 
  warnings.warn("The default behavior for interpolate/upsample with float scale_factor will change "
learn.recorder.plot_loss()

Architecture

Another simple lever is to increase the number of layers in the neural network. The more layers, the higher capacity to learn from data. This also means the higher capacity that you overfit, so more layers does not always mean better!

Note: Overfitting is defined as continued training increases your validation loss. Many people feel that overfitting is when your training loss is less than validation loss, but in reality almost every well tuned model will have a lower training loss than validation loss. If the prediction accuracy on things the model haven’t seen keeps getting better, how can you be overfitting?
Lets see what the impact can be from using a different architecture. A few comments:
  • We were using resnet34 before, and now we are using resnet101. the number represents the number of layers
  • We added a to_fp16. We are actually decreasing the precision of our calculations a bit for 2 reasons
    • Faster to train
    • Helps combat overfitting
  • We are using fine_tune with freeze_epochs. All we are doing is training with the earlier layers frozen for 3 epochs, then training unfreezing, then training for 6 epochs. Take a look through the code of the fine_tune method at the end and you will see that I am not summarizing, that's just exactly what it is doing. (self.freeze -> self.fit_one_cycle -> divide learning rate -> self.unfreeze -> self.fit_one_cycle).

With the resnet101 we are down to less that 5% error rate. Even better!

learn = cnn_learner(dls, resnet101, metrics=error_rate).to_fp16()
learn.fine_tune(6, freeze_epochs=3)
epoch train_loss valid_loss error_rate time
0 1.371693 0.247931 0.079161 02:43
1 0.551731 0.246441 0.077131 02:39
2 0.371089 0.233275 0.080514 02:39
/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py:2854: UserWarning: The default behavior for interpolate/upsample with float scale_factor will change in 1.6.0 to align with other frameworks/libraries, and use scale_factor directly, instead of relying on the computed output size. If you wish to keep the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. 
  warnings.warn("The default behavior for interpolate/upsample with float scale_factor will change "
epoch train_loss valid_loss error_rate time
0 0.258692 0.246271 0.081191 03:36
1 0.310040 0.321208 0.091340 03:33
2 0.266156 0.289912 0.081867 03:34
3 0.149828 0.213094 0.060893 03:34
4 0.082449 0.183762 0.056834 03:35
5 0.048759 0.173572 0.049391 03:34
learn.recorder.plot_loss()
??learn.fine_tune
Signature:
learn.fine_tune(
    epochs,
    base_lr=0.002,
    freeze_epochs=1,
    lr_mult=100,
    pct_start=0.3,
    div=5.0,
    lr_max=None,
    div_final=100000.0,
    wd=None,
    moms=None,
    cbs=None,
    reset_opt=False,
)
Source:   
@patch
@log_args(but_as=Learner.fit)
@delegates(Learner.fit_one_cycle)
def fine_tune(self:Learner, epochs, base_lr=2e-3, freeze_epochs=1, lr_mult=100,
              pct_start=0.3, div=5.0, **kwargs):
    "Fine tune with `freeze` for `freeze_epochs` then with `unfreeze` from `epochs` using discriminative LR"
    self.freeze()
    self.fit_one_cycle(freeze_epochs, slice(base_lr), pct_start=0.99, **kwargs)
    base_lr /= 2
    self.unfreeze()
    self.fit_one_cycle(epochs, slice(base_lr/lr_mult, base_lr), pct_start=pct_start, div=div, **kwargs)
File:      ~/anaconda3/lib/python3.7/site-packages/fastai2/callback/schedule.py
Type:      method