Below is a summary of the Top 5 winning solutions for Kaggle’s Google QUEST competition. As this was my first NLP competition I came away from it with a little bit of a disappointing result but also with a lot of learnings for the next one!
Competition Overview
In short, the goal of the competition was to predict 30 scores (between 0 and 1) that raters had given to different aspects of questions and answers from Stackoverflow.
Competition learnings
entered too late (2 weeks) given my lack of NLP experience
chose to be a contrarian and try for LSTM glory instead of starting with transformers
chose to use fastai v2 over v1 meaning too much time learning v2 and not enough time on modelling
spent too much time on text preprocessing
The upside is that at least I know the innards of LSTMs and fastai v2 much much better :D
Summary of Solutions
First of all thanks to all the winners who shared their solutions, I'd strongly encourage you to have a look at all of their solutions here and give them an upvote, some really really elegant work in there. Also, @oohara beat me by 5 hours with his summary of solutions post here, I’d recommend checking that out too as they summarise solutions beyond the top 5.
Each of the solutions posts is linked in its respective title. I’ve also added a 🌟beside ideas that I really liked or hadn’t heard about before, completely subjective and more for my future reference. Ok enough waggle, let’s go!
1st - Ay Caramba!
Linear blend of BERTs + 🌟BART
Pretrained LM + Pseudo Labelling + Post Processing (binning)
Pre-trained Language Models from stackexchange data, in addition to the common MLM task, the model also predicted 6 auxiliary targets with a linear layer on top of pooled LM output.
Training:
GroupKFold with question_title groups
🌟Architecture with Multi-Sample Dropout for Accelerated Training and Better Generalization
Differential learning rates for encoder and head
🌟Using CLS outputs from all BERT layers rather than using only the last one. Done by using a weighted sum of these outputs, where the weights were learnable and constrained to be positive and sum to 1
Their solution post also has a wise tip about the pitfalls of pseudo-labelling with similar data to the train set leading to the dangers of overfitting
2nd - Dual Transformer
Models:
One transformer for the question text, one for the answer text
2 x dual Roberta-base
🌟 dual Roberta-large (2x 256 tokens, turns out halving number of tokens didn't lose much performance for roberta-large)
dual XLNet-base
🌟 Siamese Roberta-large with weighted averaged layers (similar to 1st place solution)
See their post for really nice explanation and visualisation of the architectures
Stacking 2 Linear layers without activation performed better than single Linear layer in some cases.
Some of the following models have category embedding structure additionally.
Found differential lrs + warmup schedule was key to dual-model arch, see their post for description of their model heads (average -> concat -> 30 x 2 linear layers)
Same input text for all, but targets were changed to an ordinal representation (gave 170 target cols)
Training:
Differential learning rates (transformer @ 3e-5, model head(s) @ 5e-3)
3 epochs, cosine scheduling, 1 epoch warmup
AdamW or RAdam with weight decay of 0.01
Effective batchsize of 8 using 🌟gradient accumulation
🌟 Modified 5-fold GroupKFold strategy (see post for more)
Post-processing: threshold clipping
Equally weighted blend of the 5 model outputs
(In my notes I saved this link here to “Parameter-Efficient Transfer Learning for NLP”, so I’m leaving it here for now until I find a better home for it)
3rd - All the models
Used text columns plus a category column
🌟 Text truncation: i) pre-truncate, ii) post-truncate, iii) head + tail tokens ,iv) longer answer
Head + tail tokens method from How to Fine-Tune BERT for Text Classification
Models ensemble:
🌟LSTM + Universal Sentence Encoder
2 x BERT-base uncased, 2 x BERT-base cased, 2 x BERT-large uncased, 2 x BERT-large cased
ALBERT-base, Roberta-base, GPT2-base, XLNet-base
Stacking 2 Linear layers without activation performed better than single Linear layer in some cases.
Some of the models had a category embedding structure additionally.
See their post for really nice visualisations of their model architectures
Training:
Min-Max target scaling
Large weights for minor positive or negative samples
🌟 gelu_new activation for BERT-based models
Cosine warmup scheduler, 5-fold CV
EMA (eponential moving average of weights)
Post processing: clipped predictions with thresholds are decided by golden section search (see solution post for code example and also here for more on golden search).
Blending: TPE (Tree-structured Parzen Estimator) optimization using Optuna
Didn't work: manifold mixup, pretraining masked LM, input-response prediction
CODE: LSTM code here, it’s a great kernel!
4th - Triple Transformer
Solution post here, nice architecture overview post here and post-processing post here
Similar to the 2nd place solution but combining 3 transformers instead of 2
One different BERT + text input to predict each of:
21 question targets
9 answer targets
all 30 targets
Backpropagate in every transformer using 4 losses separately: loss1(answerstarget, output from bertanswers), loss2(questiontargets, output from bertquestion), loss3(targets, output from bert3 which combines answer+question), loss4 is final loss after concatenating all logits from different transofmers)
Do the same again for XLNet
Flexible rating module for encoding 2 texts, e.g.:
if len(text1) < max len1 & Len(text2) > max len2:
max len2 = max len2 + (max_len1 - len(text1))
MultiStratifiedKFold used
Data Augmentation (20%): English -> Spanish -> English, using TextBlob
Additional info here from the Kaggle Jigsaw Toxic Comment competition
Post processing code here: https://www.kaggle.com/c/google-quest-challenge/discussion/129831
5th - Dual models
Based on Internet Sentiment Analysis solution: BDCI2019-SENTIMENT-CLASSIFICATION (link in Mandarin)
Split into a 2 models, 1 for question targets and 1 for answer targets
Roberta-large, Roberta-base, XLNet used
Added additional host, categorial and statistical features e.g. word count, answer word count vs question word count)
Text cleaning to remove stop words and some symbols
Target output blend: 0.4 x Roberta-large + 0.3 Roberta-base + 0.3 XLNet base
Thats the Top 5 solutions from the Google QUEST competition, see here if you’d like to see the winners’ own summaries and here for a summary of solutions beyond the Top 5. Now to start the DeepFakes competition!!
As always, any feedback on this or my other posts is always welcome, you can reach my on Twitter: @mcgenergy.