After performing dismally in the Kaggle RSNA Intracranial Haemorrhage Competition thanks to a pig-headed strategy and too little thinking I resolved to see what the winners had done right. This series posts will cover what I learned looking at the code shared by the 2nd placed team, who’s solution I found both approachable and innovative, hope you enjoy.
Part 4 of this series will give an overview of the second stage of model development; training the LSTM model.
Here is the original code from the 2nd place solution covering the LSTM training that I will walk through in this post
Here are my Jupyter Notebooks to accompany these posts
If you need a recap of the first stage of model training, part 3 of the series covering the image classifier training is here
Main Observations
I really enjoyed taking a look under the hood of this LSTM training script. The main thing that stood out for me was the need **to be very careful when handling the data with the custom Dataset and Collate Function**. Making sure to keep the correct sequence of images when processing the data and feeding it to the model was the goal here.
The idea to show the model the difference between the current embedding and the previous/next in the sequence was a really nice touch. Also handling batches whose elements contained different sequence lengths was an issue that was nicely dealt with. The LSTM model is surprisingly easy to understand, yet clearly very effective.
Dataset Prep
I think one key to the 2nd place winner’s success was their uber careful treatment of the data. Ordering and keeping track of the sequences of embeddings, using the delta between the current and the previous/next embeddings and handling cases when your batch contains sequences of different length
Ordering and Keeping Track of Sequences
1. Create a unique Series Identifier - SliceID
Concatenating [PatientID, SeriesInstanceUID, StudyInstanceUID] gave a unique sequence identifier, however it still doesn’t inform the order within the sequence.
trnmdf['SliceID'] = trnmdf[['PatientID', 'SeriesInstanceUID',
'StudyInstanceUID']].apply(lambda x: ''.format(*x.tolist()), 1)
2. Get the Sequence Order of the Embeddings
`ImagePositionPatient` from the DICOM metadata specifies the X,Y and Z position of the top left corner of the image. See DICOM documentation here for a fuller explanation. Sorting the data set by SliceID and the X, Y and Z coordinates will mean our dataframe now has the correct order
# Generate poscols like this: ['ImagePos1', 'ImagePos2', 'ImagePos3']
poscols = ['ImagePos{}'.format(i) for i in range(1, 4)]
trnmdf[poscols] = pd.DataFrame(trnmdf['ImagePositionPatient'].apply(
lambda x: list(map(float, ast.literal_eval(x)))).tolist())
Sort dataframe by SliceID and the X,Y and Z coordinates, select a subset of the columns then reset the index
trnmdf = trnmdf.sort_values(['SliceID'] + poscols)[
['PatientID', 'SliceID', 'SOPInstanceUID'] + poscols
].reset_index(drop=True)
Now we have a dataframe ordered by SliceID and sequence element
3. Index the Embeddings in Each Sequence - seq
Group by the SliceID col and then do a cumulative count for each item that it is grouped by. This will give a count from 1 to the length of the specific sequence, then reset again once the next sequence starts
trnmdf['seq'] = (trnmdf.groupby(['SliceID']).cumcount() + 1)
Embedding DELTA COMPARISON
In the custom IntracrancialDataset class is where the delta between the current and previous/next embedding is computed. patemb
is the sequence of embeddings. If a sequence had 36 images then patemb
will have shape (36, 2048). The output from this will be that the embedding size will triple from 2048 to 6144 as the current and two delta embeddings are concatenated together. This is the embedding that will be fed into the model.
patdeltalag = np.zeros(patemb.shape) . # Array of zeros with the same shape as patemb
patdeltalead = np.zeros(patemb.shape)
# Difference between current and previous image in sequence
patdeltalag [1:] = patemb[1:] - patemb[:-1]
# Difference between current and next image in sequence
patdeltalead[:-1] = patemb[:-1] - patemb[1:]
# The 3 embeddings are concatenated together going from 3 x (36, 2048) to (36, 6144)
patemb = np.concatenate((patemb, patdeltalag, patdeltalead), -1)
TRAIN/VAL SPLIT
The train/validation split here is dictated by the train/val split made in the image classifier stage of modelling. Because the train/val split in this stage was done on PatientID, entire image embeddings sequences will be kept in the same train/val group so we shouldn’t have any issues in the LSTM stage of sequences being split across the train/val datasets.
You can still create a new train/val split if you like btw, but then comparing your validation results between the ResNeXt model and the LSTM model wouldn’t make sense as the each validation metric would be calculated on different datasets, and so it’ll be harder to understand if your LSTM is actually improving on the ResNeXt model at all.
LOAD EMBEDDINGS
Loading the saved embeddings is straight forward. One comment here is that they used the .npz
format to save the embedding arrays, which I hadn’t seen this before. NPZ is a file format by numpy that provides storage of array data using gzip compression. See the Numpy docs for how to use it.
DATALOADER
THE Collate Function
`collate_fn` is an argument to the Pytorch Dataloader. From the PyTorch docs, it’s default behaviour is to merge “a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset”. This argument can also take a custom function and perform additional manipulations on your batch. This is what the team used to deal with sequences of varying lengths, and indeed the docs mention one use case as “padding sequences of various lengths”.
1. Dummy Mask
A series of dummy array of zeros is created as needed for each sequence in the batch. This is our “mask”. As we iterate through the batch, the length of this mask (= number of dummy arrays) will be the difference between the length of the longest sequence in the batch and the current sequence length. So for example if you have a batch of 4 sequences with lengths [34, 40, 37, 40], the mask length created for each sequence will be [6, 0, 3, 0].
# "b" below is a single item (sequence) from the batch. "maxlen" is the length of the largest item in the batch
masklen = maxlen-len(b['emb'])
# Stack a number ("masklen") of dummy embeddings onto the current sequence of embeddings
# to make sure all sequences in the batch are the same length
b['emb'] = np.vstack((np.zeros((masklen, embdim)), b['emb']))
2. Keeping track
The mask is appended to the start of the sequence, so we need to keep track of this in our indices.
-1 is added to the embedding index (“embidx”) to ensure that the dummy arrays are always at the start when sorting by this index, e.g. our embidx might now look like tensor([-1, -1, -1, -1, 4567, 12834, 977, …])
b['embidx'] = torch.cat((torch.ones((masklen),dtype=torch.long)*-1, b['embidx']))
The "mask" array is an array of flags to indicate whether the embedding is a dummy or not, e.g. sticking with the previous example :"array([0., 0., 0., 0., 1., 1., 1.,… ])".
b['mask'] = np.ones((maxlen))
b['mask'][:masklen] = 0. # set the first "masklen" number of values to zero to indicate they are dummy arrays
If there are labels in your dataloader (will be true for your Train and Validation loaders), then you also need to pad your labels array to account for the mask.
if withlabel:
b['labels'] = np.vstack((np.zeros((maxlen-len(b['labels']), labdim)), b['labels']))
You can check out my Jupyter notebook repo for the full collate_fn code.
MODELLING PARAMETERS
Modelling parameters used in the team’s solution were fairly straightforward:
Model : Custom LSTM, from stage 1 Kaggle Toxic comp
Epochs : 10 per fold
Learning rate: 1e-4, with 0.95 stepped decay (sorry the image is so huge :D):
Folds : 2 (my notebook series doesn't implement folds. Also the 2nd place solution only got the chance to train for 2 folds before the competition ended, but wanted to train for more)
Optimizer : Adam
Batch Size : 4
Mixed Precision used (via Nvidia’s
apex
library)
LSTM
From the team’s description:
LSTM architecture lifted from the winners of first stage toxic competition. This is a beast - only improvements came from making the hidden layers larger. Oh, we added on the embeddings to the LSTM output and this helped a bit also
From the team’s repo, this is the entire model overview, including the LSTM architecture on the right:
Another way of visualising the LSTM stage would be:
Hopefully at least 1 of these visualisations makes sense to you! Note that in the original code, there was also mention of “SpatialDropout”, however it wasn’t used in the final solution.
Prediction
Finally one small thing to note is that when calculating the validation loss and and test predictions the mask embeddings are stripped out from the sequences. See the prediction()
function in the code for details
Thats a Wrap!
Thanks to those of you for reading this far, I hope this breakdown will help you in your own machine learning journey, be it on Kaggle or elsewhere.
If you have any feedback for me about this series, or the blog in general I’d love to hear from you on Twitter: @mcgenergy