Getting deeper and deeper into the weeds of deep learning and loving it :D @mcgenergy   github.com/morganmcg1 

[3/4][Kaggle] The Image Classifier - 2nd Place Solution Breakdown - RSNA Intracranical Hemorrhage Competition

After performing dismally in the Kaggle RSNA Intracranial Haemorrhage Competition thanks to a pig-headed strategy and too little thinking I resolved to see what the winners had done right. This series posts will cover what I learned looking at the code shared by the 2nd placed team, who’s solution I found both approachable and innovative, hope you enjoy.

Part 3 of this series will give an overview of the first stage of model development; training the image classifier.

If you need a recap of the how the data was prepared, part 2 of the series covering the Data Preparation strategy is here

Recap

The goal of the competition was to predict if a CT scan contained any of 6 different classes; 5 haemorrhage types and 1 “any” class:

  • 'epidural'

  • 'intraparenchymal'

  • 'intraventricular'

  • 'subarachnoid'

  • 'subdural'

  • 'any'

The modelling strategy team NoBrainer used:

  1. Train an image classifier using 5-fold validation

  2. Run inference on all images and extract the global average pool (GAP) activations from the model to use as embeddings

  3. Train a LSTM by running these embeddings in sequence (patient by patient) through it

  4. Make predictions using the LSTM

This post will cover (1) and (2) above while the final post will cover the remaining modelling stages.

Train/Validation Split

The team used 5-fold cross validation, where folds were broken up by PatientID. Splitting by patient was important; a patient could have between 25 and 60 images in the dataset meaning that if the data was not split by PatientID then there is a chance that scans from the same patient could appear in both the train and validation sets. See the code here for how the folds were constructed

My dataset split

For faster experimentation I wanted a dataset of about 100K images, so I created my own train/validation split with a 50/50 positive/negative split

  • Select the positives for the train and val sets:
    • Find all postive patients for each label (' == 1')
    • Randomly sample the desired number of patients (each patient had an average of ~40 images) for the train set
    • 85% in the train set and 15% in the test set
    • Randomly sample the remaining 15% for the patients val set from the remainder
    • Select all images corresponding to these patients for the train and validation set
    • Remove any overlapping PatientIDs between the train and val set from the val set. They have crept in between the train and val sets due to some patients having multiple different haemorrage types
    • Remove any negative images from both sets. These are added when all images from a patient are added. Because the images are a sequence often the beginning and ending images are negatives, thus when all images were selected some negatives crept in.
  • Select the negatives for the train and val sets:
    • Randomly select negative patients (don't have any == 1)
    • Split patients by 85/15 split
    • Select images from these patients and match the count in the positive set
    • Merge negatives dataframe with positives datadrame

Image Cropping - AutoCrop

Cropping the images should help the model by making sure it focuses on only the most important region of the image. The autocrop function cuts "any black space back to edges of where non-black space begins; although keep the square aspect ratio.”

Before and after using the autocrop function

Before and after using the autocrop function

def autocrop(image, threshold=0):		                   
     '''Crops any edges below or equal to threshold         
     Returns square, cropped image.         
     https://stackoverflow.com/questions/13538748/crop-black-edges-with-opencv         
     '''          
     if len(image.shape) == 3:             
     # Takes just 1 channel (the last one)             
          flatImage = np.max(image, 2)         
     else:             
          flatImage = image          
     # EXPLAINER:         
     # - np.max(flatImage, 0)         
     #     For each of the 512 rows in the image, find the max value for each one         
     #  - np.where(np.max(flatImage, 0) > threshold) :         
     #     Find the rows indices where the max value is greater than the threshold (zero in this case)                      
     #     i.e. value isn't black         
     #  Returns the row indices where the max value is greater than zero (i.e. not black)          
     rows = np.where(np.max(flatImage, 0) > threshold)[0]         
     cols = np.where(np.max(flatImage, 1) > threshold)[0]          

     # Crop image to the first and last rows and columns that are non-zero         
     # e.g. image[79 : 346 + 1, 43 : 443]         
     image = image[cols[0]: cols[-1] + 1, rows[0]: rows[-1] + 1]    # image.shape now = (357, 399, 3)

     sqside = max(image.shape)   # e.g. 399          
     # Create new 3-channel square black image with, for example, dimension (512,512,3)         
     imageout = np.zeros((sqside, sqside, 3), dtype = 'uint8')          
     # Copy pixels from cropped image into the new image         
     imageout[:image.shape[0], :image.shape[1],:] = image.copy()   # imageout.shape = (399,399,3)            
     return imageout

Alternative cropping method

An alternative, more thorough cropping technique I would highly recommend is the method used in the fastai v2 library and demonstrated in the “Cleaning the data for Rapid Prototyping” notebook from Jeremy Howard. From the example below it can do more than just crop black edges, handling non-head artefacts in the image much more precisely.

Screenshot 2020-01-25 at 11.58.06.png

Bonus Cropping - autocropmin

autocropmin is a function that the 2nd place team coded up, but in the heat of competition forgot to add to their training script.It should be better at removing artifacts around images such as the curved lines above from the scanning machine. It performs better than I would still recommend checkout out what the fastai2 library can do for masking instead of using this. See my jupyter notebook for the autocropmin code

Screenshot 2020-01-29 at 13.30.02.png

Data Augmentation

Core to any successful model training is a strong data augmentation policy. The team used the below transforms from the Albumentations transforms library:

  • HorizontalFlip(p=0.5)

  • ShiftScaleRotate(shift_limit=0.05, scale_limit=0.05)

  • rotate_limit=20, p=0.3, border_mode = cv2.BORDER_REPLICATE)

  • Transpose(p=0.5)

  • Normalize(mean=mean_img, std=std_img, max_pixel_value=255.0, p=1.0)


Custom Dataset Class

The custom dataset class created performs the following:

  • Appends the .jpg suffix to the image name

  • Reads the image data into an array using OpenCV

  • Crops the image using autocrop (if crop flag is set to true)

  • Resizes the image to the specified size, (480,480) in this case

  • Transforms the image according to the transforms given

  • Adds labels to the output dataset

And returns a dictionary with the image array and the labels for that image (where labels flag is true)


Model Training Parameters

  • Model : ResNeXt-101 32x8d

  • Image size : (480,480)

  • Batch size : 18

  • Epochs : 5

  • lr : 2e-5

  • Folds : 5 (used 5-fold validation, but only trained on 3 of the folds)

  • Optimizer : Adam

  • Loss : Binary Cross Entropy (BCEWithLogitsLoss)

Mixed precision training was also used via the “Apex” library from Nvidia.


Image Classifier - ResNeXt-101 32x8d

Lets have a look at the ResNeXt-101 32x8d model that the 2nd place solution used.

What is a ResNeXt Model?

RexNeX models were introduced in “Aggregated Residual Transformations for Deep Neural Networks” in 2016. A ResNeXt model basically just adds an additional dimension (i.e. the "next" dimension) to a traditional ResNet block:

Screenshot 2020-01-24 at 13.44.54.png

This next dimension is called the “cardinality” dimension. ResNeXt came 2nd at the 2016 ILSVRC classification task when it was first introduced. From the authors:

"In this paper, we present a simple architecture which adopts VGG/ResNets’ strategy of repeating layers, while exploiting the split-transform-merge strategy in an easy, extensible way. A module in our network performs a set of transformations, each on a low-dimensional embedding, whose outputs are aggregated by summation. We pursuit (sic) a simple realisation of this idea — the transformations to be aggregated are all of the same topology (e.g., Fig. 1 (right)). This design allows us to extend to any large number of transformations without specialised designs."

The authors then simplified this “split-transform-merge” strategy in order for a faster, easier implementation. All 3 solutions below are equivalent, (c) is what the authors chose to implement:

The authors chose to implement (c) because it is more succinct and faster than the other two forms.

The authors chose to implement (c) because it is more succinct and faster than the other two forms.

The ResNeXt-101 32x8d WSL Model Used

TL;DR These "WSL" models were trained by Facebook AI Research on 940 million Instagram images in a **weakly supervised** fashion before being fine-tuned on ImageNet.

Naming Convention

This model has 101 layers, 32 groups and a group width (cardinality) of 8d (bottleneck width). See the paper above for a more in-depth explanation

Other Model Options?

Hindsight is 20:20 and potentially there were other model architectures for image classification that could have worked even better here. One (imperfect) way to assess potential model performance is on ImageNet top 1% accuracy.

ResNeXt-101 32x8d top-1 ImageNet accuracy

  • The ResNeXt-101 32x8d model used here and had a top-1 of 82.2% and had 88M parameters

  • Its bigger brother, the ResNeXt-101 32x16d model, had 84.2% and had 193M parameters. This is a real beast meaning it wouldn't be practical to use for a dataset this large given the limited timeline.

EfficientNet-b7

For comparison:

  • the EfficientNet-b4 model has a top-1 accuracy of 83.0% with only 19M parameters, 0.8% higher than ResNeXt-101 32x8d

  • the EfficientNet-b7 model has 84.5% and 66M, 2.3% higher while still having less parameters

Potentially either of these could have been viable candidates. The 2nd place solution winners did state they trialled the EfficientNet-b0 model, but most likely this small model just wouldn't have enough capacity to make the most of this large RSNA dataset

Assemble-ResNet-50

Mentioning this model is a little facetious as this model hadn't been released while the competition was ongoing. But if it was it seems like a great candidate for at least some rapid experimentation. It is actually a modified ResNet-50 trained with some recent training techniques and data augmentations.

It beats out the ResNeXt-101 32x8d top-1 ImageNet result by 0.6% at **82.8%** yet is a much smaller model given that it is based on a ResNet-50

Having said all that, the only was of knowing would be to actually test all of this :)

Loss Function

Binary Cross Entropy loss was used here. I won’t go into depth on it, this article here has a great explainer if you’d like to know what it is, but I will mention that it was found out by a kaggler during the competition that of the 6 classes to predict, the “any” class had twice the weight of the other 5 classes, thus a custom loss function was needed:

def criterion(data, targets, criterion = torch.nn.BCEWithLogitsLoss()):    
 ''' Define custom loss function for weighted BCE on 'target' column '''
    loss_all = criterion(data, targets)     
    loss_any = criterion(data[:,-1:], targets[:,-1:])    
    return (loss_all*6 + loss_any*1)/7

Extracting the Embeddings

Once the model was trained to a reasonable validation loss, the embeddings were then generated by running inference on each image in the dataset. An “Identity” module is appended to the model head in order to allow easy extraction of the final layer weights

class Identity(nn.Module):     
    def __init__(self):         
    super(Identity, self).__init__()              
    def forward(self, x):         
    return x

The Identity layer is set as the final fully connected layer

model = torch.hub.load('facebookresearch/WSL-Images', 'resnext101_32x8d_wsl')   
model.fc = torch.nn.Linear(2048, n_classes)
model.module.fc = Identity()

At its core, the extraction simply comes down to saving the output of the final layer from the model

embeddings_list = []
out = model(inputs)
embeddings_list.append(out.detach().cpu().numpy())

Next Up…

In this post we covered the key components of the image classifier part of the 2nd place solution. If you’d like to run the above check out the Jupyter Notebooks to accompany this series here. Next up we’ll have a look at the second part, running the generated embeddings through a LSTM model!

Paper Citations

Assembed-CNN

Jungkyu Lee and Taeryun Won and Kiho Hong, Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network. 2020. Arxiv link: https://arxiv.org/abs/2001.06268.

EfficientNet

Mingxing Tan and Quoc V. Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. ICML 2019. Arxiv link: https://arxiv.org/abs/1905.11946.

ResNeXt

Saining Xie and Ross Girshick and Piotr Dollár and Zhuowen Tu and Kaiming He. Aggregated Residual Transformations for Deep Neural Networks. 2016. https://arxiv.org/pdf/1611.05431.pdf

ResNeXt WSL models

Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, Laurens van der Maaten. 2018. Exploring the Limits of Weakly Supervised Pretraining. https://arxiv.org/pdf/1805.00932.pdf

[4/4][Kaggle] The LSTM Model - 2nd Place Solution Breakdown - RSNA Intracranical Hemorrhage Competition

[2/4][Kaggle] Data Prep - 2nd Place Solution Breakdown - RSNA Intracranical Hemorrhage Competition