After performing dismally in the Kaggle RSNA Intracranial Haemorrhage Competition thanks to a pig-headed strategy and too little thinking I resolved to see what the winners had done right. This series posts will cover what I learned looking at the code shared by the 2nd placed team, who’s solution I found both approachable and innovative, hope you enjoy.
Part 3 of this series will give an overview of the first stage of model development; training the image classifier.
Here is the original code I will walk through in the post
Here are the Jupyter Notebooks to accompany these posts are here
If you need a recap of the how the data was prepared, part 2 of the series covering the Data Preparation strategy is here
Recap
The goal of the competition was to predict if a CT scan contained any of 6 different classes; 5 haemorrhage types and 1 “any” class:
'epidural'
'intraparenchymal'
'intraventricular'
'subarachnoid'
'subdural'
'any'
The modelling strategy team NoBrainer used:
Train an image classifier using 5-fold validation
Run inference on all images and extract the global average pool (GAP) activations from the model to use as embeddings
Train a LSTM by running these embeddings in sequence (patient by patient) through it
Make predictions using the LSTM
This post will cover (1) and (2) above while the final post will cover the remaining modelling stages.
Train/Validation Split
The team used 5-fold cross validation, where folds were broken up by PatientID. Splitting by patient was important; a patient could have between 25 and 60 images in the dataset meaning that if the data was not split by PatientID then there is a chance that scans from the same patient could appear in both the train and validation sets. See the code here for how the folds were constructed
My dataset split
For faster experimentation I wanted a dataset of about 100K images, so I created my own train/validation split with a 50/50 positive/negative split
- Select the positives for the train and val sets:
- Find all postive patients for each label (' == 1')
- Randomly sample the desired number of patients (each patient had an average of ~40 images) for the train set
- 85% in the train set and 15% in the test set
- Randomly sample the remaining 15% for the patients val set from the remainder
- Select all images corresponding to these patients for the train and validation set
- Remove any overlapping PatientIDs between the train and val set from the val set. They have crept in between the train and val sets due to some patients having multiple different haemorrage types
- Remove any negative images from both sets. These are added when all images from a patient are added. Because the images are a sequence often the beginning and ending images are negatives, thus when all images were selected some negatives crept in.
- Select the negatives for the train and val sets:
- Randomly select negative patients (don't have any == 1)
- Split patients by 85/15 split
- Select images from these patients and match the count in the positive set
- Merge negatives dataframe with positives datadrame
Image Cropping - AutoCrop
Cropping the images should help the model by making sure it focuses on only the most important region of the image. The autocrop
function cuts "any black space back to edges of where non-black space begins; although keep the square aspect ratio.”
def autocrop(image, threshold=0):
'''Crops any edges below or equal to threshold
Returns square, cropped image.
https://stackoverflow.com/questions/13538748/crop-black-edges-with-opencv
'''
if len(image.shape) == 3:
# Takes just 1 channel (the last one)
flatImage = np.max(image, 2)
else:
flatImage = image
# EXPLAINER:
# - np.max(flatImage, 0)
# For each of the 512 rows in the image, find the max value for each one
# - np.where(np.max(flatImage, 0) > threshold) :
# Find the rows indices where the max value is greater than the threshold (zero in this case)
# i.e. value isn't black
# Returns the row indices where the max value is greater than zero (i.e. not black)
rows = np.where(np.max(flatImage, 0) > threshold)[0]
cols = np.where(np.max(flatImage, 1) > threshold)[0]
# Crop image to the first and last rows and columns that are non-zero
# e.g. image[79 : 346 + 1, 43 : 443]
image = image[cols[0]: cols[-1] + 1, rows[0]: rows[-1] + 1] # image.shape now = (357, 399, 3)
sqside = max(image.shape) # e.g. 399
# Create new 3-channel square black image with, for example, dimension (512,512,3)
imageout = np.zeros((sqside, sqside, 3), dtype = 'uint8')
# Copy pixels from cropped image into the new image
imageout[:image.shape[0], :image.shape[1],:] = image.copy() # imageout.shape = (399,399,3)
return imageout
Alternative cropping method
An alternative, more thorough cropping technique I would highly recommend is the method used in the fastai v2 library and demonstrated in the “Cleaning the data for Rapid Prototyping” notebook from Jeremy Howard. From the example below it can do more than just crop black edges, handling non-head artefacts in the image much more precisely.
Bonus Cropping - autocropmin
autocropmin
is a function that the 2nd place team coded up, but in the heat of competition forgot to add to their training script.It should be better at removing artifacts around images such as the curved lines above from the scanning machine. It performs better than I would still recommend checkout out what the fastai2 library can do for masking instead of using this. See my jupyter notebook for the autocropmin code
Data Augmentation
Core to any successful model training is a strong data augmentation policy. The team used the below transforms from the Albumentations
transforms library:
HorizontalFlip(p=0.5)
ShiftScaleRotate(shift_limit=0.05, scale_limit=0.05)
rotate_limit=20, p=0.3, border_mode = cv2.BORDER_REPLICATE)
Transpose(p=0.5)
Normalize(mean=mean_img, std=std_img, max_pixel_value=255.0, p=1.0)
Custom Dataset Class
The custom dataset class created performs the following:
Appends the .jpg suffix to the image name
Reads the image data into an array using OpenCV
Crops the image using
autocrop
(if crop flag is set to true)Resizes the image to the specified size, (480,480) in this case
Transforms the image according to the transforms given
Adds labels to the output dataset
And returns a dictionary with the image array and the labels for that image (where labels
flag is true)
Model Training Parameters
Model : ResNeXt-101 32x8d
Image size : (480,480)
Batch size : 18
Epochs : 5
lr : 2e-5
Folds : 5 (used 5-fold validation, but only trained on 3 of the folds)
Optimizer : Adam
Loss : Binary Cross Entropy (BCEWithLogitsLoss)
Mixed precision training was also used via the “Apex” library from Nvidia.
Image Classifier - ResNeXt-101 32x8d
Lets have a look at the ResNeXt-101 32x8d model that the 2nd place solution used.
What is a ResNeXt Model?
RexNeX models were introduced in “Aggregated Residual Transformations for Deep Neural Networks” in 2016. A ResNeXt model basically just adds an additional dimension (i.e. the "next" dimension) to a traditional ResNet block:
This next dimension is called the “cardinality” dimension. ResNeXt came 2nd at the 2016 ILSVRC classification task when it was first introduced. From the authors:
"In this paper, we present a simple architecture which adopts VGG/ResNets’ strategy of repeating layers, while exploiting the split-transform-merge strategy in an easy, extensible way. A module in our network performs a set of transformations, each on a low-dimensional embedding, whose outputs are aggregated by summation. We pursuit (sic) a simple realisation of this idea — the transformations to be aggregated are all of the same topology (e.g., Fig. 1 (right)). This design allows us to extend to any large number of transformations without specialised designs."
The authors then simplified this “split-transform-merge” strategy in order for a faster, easier implementation. All 3 solutions below are equivalent, (c) is what the authors chose to implement:
The ResNeXt-101 32x8d WSL Model Used
TL;DR These "WSL" models were trained by Facebook AI Research on 940 million Instagram images in a **weakly supervised** fashion before being fine-tuned on ImageNet.
Naming Convention
This model has 101 layers, 32 groups and a group width (cardinality) of 8d (bottleneck width). See the paper above for a more in-depth explanation
Other Model Options?
Hindsight is 20:20 and potentially there were other model architectures for image classification that could have worked even better here. One (imperfect) way to assess potential model performance is on ImageNet top 1% accuracy.
ResNeXt-101 32x8d top-1 ImageNet accuracy
The ResNeXt-101 32x8d model used here and had a top-1 of 82.2% and had 88M parameters
Its bigger brother, the ResNeXt-101 32x16d model, had 84.2% and had 193M parameters. This is a real beast meaning it wouldn't be practical to use for a dataset this large given the limited timeline.
EfficientNet-b7
For comparison:
the EfficientNet-b4 model has a top-1 accuracy of 83.0% with only 19M parameters, 0.8% higher than ResNeXt-101 32x8d
the EfficientNet-b7 model has 84.5% and 66M, 2.3% higher while still having less parameters
Potentially either of these could have been viable candidates. The 2nd place solution winners did state they trialled the EfficientNet-b0 model, but most likely this small model just wouldn't have enough capacity to make the most of this large RSNA dataset
Assemble-ResNet-50
Mentioning this model is a little facetious as this model hadn't been released while the competition was ongoing. But if it was it seems like a great candidate for at least some rapid experimentation. It is actually a modified ResNet-50 trained with some recent training techniques and data augmentations.
It beats out the ResNeXt-101 32x8d top-1 ImageNet result by 0.6% at **82.8%** yet is a much smaller model given that it is based on a ResNet-50
Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network"]
Having said all that, the only was of knowing would be to actually test all of this :)
Loss Function
Binary Cross Entropy loss was used here. I won’t go into depth on it, this article here has a great explainer if you’d like to know what it is, but I will mention that it was found out by a kaggler during the competition that of the 6 classes to predict, the “any” class had twice the weight of the other 5 classes, thus a custom loss function was needed:
def criterion(data, targets, criterion = torch.nn.BCEWithLogitsLoss()):
''' Define custom loss function for weighted BCE on 'target' column '''
loss_all = criterion(data, targets)
loss_any = criterion(data[:,-1:], targets[:,-1:])
return (loss_all*6 + loss_any*1)/7
Extracting the Embeddings
Once the model was trained to a reasonable validation loss, the embeddings were then generated by running inference on each image in the dataset. An “Identity” module is appended to the model head in order to allow easy extraction of the final layer weights
class Identity(nn.Module):
def __init__(self):
super(Identity, self).__init__()
def forward(self, x):
return x
The Identity layer is set as the final fully connected layer
model = torch.hub.load('facebookresearch/WSL-Images', 'resnext101_32x8d_wsl')
model.fc = torch.nn.Linear(2048, n_classes)
model.module.fc = Identity()
At its core, the extraction simply comes down to saving the output of the final layer from the model
embeddings_list = []
out = model(inputs)
embeddings_list.append(out.detach().cpu().numpy())
Next Up…
In this post we covered the key components of the image classifier part of the 2nd place solution. If you’d like to run the above check out the Jupyter Notebooks to accompany this series here. Next up we’ll have a look at the second part, running the generated embeddings through a LSTM model!
Paper Citations
Assembed-CNN
Jungkyu Lee and Taeryun Won and Kiho Hong, Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network. 2020. Arxiv link: https://arxiv.org/abs/2001.06268.
EfficientNet
Mingxing Tan and Quoc V. Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. ICML 2019. Arxiv link: https://arxiv.org/abs/1905.11946.
ResNeXt
Saining Xie and Ross Girshick and Piotr Dollár and Zhuowen Tu and Kaiming He. Aggregated Residual Transformations for Deep Neural Networks. 2016. https://arxiv.org/pdf/1611.05431.pdf
ResNeXt WSL models
Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, Laurens van der Maaten. 2018. Exploring the Limits of Weakly Supervised Pretraining. https://arxiv.org/pdf/1805.00932.pdf