Fine Grain Image Classification
Creating a fine grain image classifier - Pet Breeds!
Intro
In this blog we are going to do an image classification to take dog pictures and predict the breed. This is considered 'fine grain' because the difference between classes is fairly minimal. Classifying between breeds of dogs is fine grain, classifying whether something is a dog or a cup is not.
To do this we are going to use fastaiv2, which is the new version of fastai that will come out in July. The purpose of this post is to introduce a few key concepts that will be useful as we move into harder problems
from fastai2.vision.all import *
seed = 42
# Download and get path for dataseet
path = untar_data(URLs.PETS) #Sample dataset from fastai2
path.ls()
img = (path/'images').ls()[0]
img
Data Setup
Data Blocks and data loaders are convenient ways that fastai helps us manage an load data. There is a lot going on in DataBlock so I am going to break it down piece by piece.
DataBlock
-
Blocks: What is our data? x is our images (ImageBlock) and y is our categories (CategoryBlock). In this case each image will get a dog breed as the category.
-
get_items: How do we get our data (x)? We use the predefined get_image_files for this, though we can give it something custom if needed.
-
splitter: How should we create the validation set? The splitter splits our data into test and validation sets. The default 20% validation set is just fine, but we define a seed so it is reproducable.
-
get_y: How do we get our dependent variable (y)? In this care we are going to get it from the file name. With using_attr, we can apply a function to an attribute of the file (name). So here we are using regex on the file name to get y.
-
item_tfms: What transformations do we need to do before packing things up to be sent to the GPU? In this case resizing it to 460.
-
batch_tfms: What transformations do we want to do in batches on the GPU? There are many default transforms, and we are specifying a few ourselves (min_scale and size).
pets = DataBlock(
blocks = (ImageBlock, CategoryBlock),
get_items = get_image_files,
splitter= RandomSplitter(valid_pct = 0.2, seed=seed),
get_y= using_attr(RegexLabeller(r'(.+)_\d+.jpg$'),'name'),
item_tfms=Resize(460),
batch_tfms=aug_transforms(min_scale = 0.9,size=224)
)
dls = pets.dataloaders(path/"images")
dls.show_batch()
Training the Model
Get a basic model working
In 2 lines of code I am going to create and train a basic model. There's a couple things to note:
- I am using the dls from the previous step. This is where we defined how to load the data, how to label it, data augmentation, training/validation split, etc.
- I can also pass standard architectures. "Resnet" is a common architecture for image classification. "34" signifies that it has 34 layers. If you wish to understand what a layer is, please check out the Image Classifier Basics blog posts that build a simple 1 layer net.
- I set a metric, but use the default loss metric.
Note: Loss is what the model uses to train on. Error rate is just for our reference. Accuracy and error rates make very poor choices for loss function because they have either 0 or infinite slope, so calculating the gradient/path value/derivative is not meaningful. This is a prime example of why good functions for computers to understand what’s going on and good functions for peope to understand what’s going on can be very different things.
learn = cnn_learner(dls, resnet34, metrics=error_rate)
Next we are going to fine tune the model.
If we were starting from scratch when training a model we will train every layer. Fine tuning is about training the final layer(s) and leaving the rest intact. Previous layers were set using weights via transfer learning. What this means is that a model was trained to be able to detect and classify a bunch of different objects. The earlier layers of the neural network detect things that are common to lots of images (ie circles, edges, corners, etc.). These don't need to change much generally. The last layer is predicting the specific thing, in our case pet breeds. This is what we need to change.
learn.fine_tune(3)
learn.recorder.plot_loss()
We see out validation loss improves significantly with our error rate. We will use this error rate at our baseline.
learn.show_results()
Many times in classification we look at a confusion matrix. As you can see, when we start having a lot of classes, it is really hard to make anything meaningful out of the confusion matrix. There's just too many classes.
Next, we will take a look at some more specif data. Let's start with a high level confusion matrix.
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix(figsize=(12,12), dpi=60)
Instead, we look at top losses to see where our model was most wrong. We also look at the 'most confused' function whichprints which classes it gets confused on most often.
interp.plot_top_losses(9, figsize=(15,10))
interp.most_confused(min_val=3)
Make a Better Model
Now that we have a baseline using the defaults, let's see what we can do to improve it. We will talk about a few main topics.
- Freezing and Unfreezing for training
- Learning Rate Finder
- Discriminate learning rrate
- Architecture
Learning Rate Finder
The Learning Rate Finder is very important because setting a good learning rate can make or break a model. We will use it multiple times, and it will come up in essentially every deep learning project. It is good to spend some time to understand what it is showing and expirament.
The learning rate finder (lr_find) gives us 3 things:
- lr_min: This is the learning rate that gives us the minimum loss in our graph. 1 common rule of thumb is to divide this by 10 and use that as your learning rate.
- lr_steep: This is the steepest point on our graph. Another rule of thumb is to make this your learning rate. Interestingly enough, these 2 rules of thumb often end up with very similar results.
- graph: The graph is really what i use when determining a learning rate. At the beginning of the graph we see very little reduction in loss. At the end of the graph we see loss spiking. Obviously nether of those are good. In reality we want our learning rate to be somewhere in the middle-ish of that steep decline. This is in line with our 2 rule of thumbs.
learn = cnn_learner(dls, resnet34, metrics=error_rate)
lr_min,lr_steep = learn.lr_find()
print(f"Minimum/10: {lr_min:.2e}, steepest point: {lr_steep:.2e}")
Now that we have our graph, let's train our model with this learning rate.
What's the difference between fine tune and fit one cycle?
learn.fit_one_cycle(3, 3e-3)
Unfreezing
Unfreezing a model is about training the full model. We spoke earlier about fine_tune only training that last layers. This is a great start and we want to train the last layers more than the early layers, but we still want to train the early layers. Unfreeze allows us to do this. Now that we unfroze the model and are going to train the model more, we will need to do our learning rate finder to pick a good learning rate again.
learn.unfreeze()
learn.lr_find()
Discriminative Learning Rates
Discriminative Learning Rates means that we are going to use a different learning rate for each layer. We have previously really been training the final layers of the model. We are now ready to udate all the weights and biases, including the ones that were set by our transfer learning. While we do want to train the whole model, we don't want to train it at the same speed. We want more changes in the end (ie figuring out exactly which breed it is) and less changes in the early layers (ie identifying lines). This isn't so different than how people work. Learning a new thing (ie a type of dog breed) is much easier than improving my fundamental understanding of the world around me.
We will fit this now for 6 epochs. The first layers will have 1e-6 learning rate. The final layers will have 1e-4. Middle layers will be between those 2 numbers. We can see we get down to just under a 6% error rate. 94% accuracy - not bad!
learn.fit_one_cycle(6, lr_max=slice(1e-6,1e-4))
learn.recorder.plot_loss()
Architecture
Another simple lever is to increase the number of layers in the neural network. The more layers, the higher capacity to learn from data. This also means the higher capacity that you overfit, so more layers does not always mean better!
- We were using resnet34 before, and now we are using resnet101. the number represents the number of layers
- We added a to_fp16. We are actually decreasing the precision of our calculations a bit for 2 reasons
- Faster to train
- Helps combat overfitting
- We are using fine_tune with freeze_epochs. All we are doing is training with the earlier layers frozen for 3 epochs, then training unfreezing, then training for 6 epochs. Take a look through the code of the fine_tune method at the end and you will see that I am not summarizing, that's just exactly what it is doing. (self.freeze -> self.fit_one_cycle -> divide learning rate -> self.unfreeze -> self.fit_one_cycle).
With the resnet101 we are down to less that 5% error rate. Even better!
learn = cnn_learner(dls, resnet101, metrics=error_rate).to_fp16()
learn.fine_tune(6, freeze_epochs=3)
learn.recorder.plot_loss()
??learn.fine_tune