Getting deeper and deeper into the weeds of deep learning and loving it :D @mcgenergy   github.com/morganmcg1 

[Kaggle] Google QUEST NLP - Top 5 Solution Summaries

Below is a summary of the Top 5 winning solutions for Kaggle’s Google QUEST competition. As this was my first NLP competition I came away from it with a little bit of a disappointing result but also with a lot of learnings for the next one!

Competition Overview

In short, the goal of the competition was to predict 30 scores (between 0 and 1) that raters had given to different aspects of questions and answers from Stackoverflow.

Competition learnings

  • entered too late (2 weeks) given my lack of NLP experience

  • chose to be a contrarian and try for LSTM glory instead of starting with transformers

  • chose to use fastai v2 over v1 meaning too much time learning v2 and not enough time on modelling

  • spent too much time on text preprocessing

The upside is that at least I know the innards of LSTMs and fastai v2 much much better :D

Summary of Solutions

First of all thanks to all the winners who shared their solutions, I'd strongly encourage you to have a look at all of their solutions here and give them an upvote, some really really elegant work in there. Also, @oohara beat me by 5 hours with his summary of solutions post here, I’d recommend checking that out too as they summarise solutions beyond the top 5.

Each of the solutions posts is linked in its respective title. I’ve also added a 🌟beside ideas that I really liked or hadn’t heard about before, completely subjective and more for my future reference. Ok enough waggle, let’s go!

1st - Ay Caramba!

  • Linear blend of BERTs + 🌟BART

  • Pretrained LM + Pseudo Labelling + Post Processing (binning)

  • Pre-trained Language Models from stackexchange data, in addition to the common MLM task, the model also predicted 6 auxiliary targets with a linear layer on top of pooled LM output

  • Training:

    • GroupKFold with question_title groups

    • 🌟Architecture with Multi-Sample Dropout for Accelerated Training and Better Generalization

    • Differential learning rates for encoder and head

    • 🌟Using CLS outputs from all BERT layers rather than using only the last one. Done by using a weighted sum of these outputs, where the weights were learnable and constrained to be positive and sum to 1

  • Their solution post also has a wise tip about the pitfalls of pseudo-labelling with similar data to the train set leading to the dangers of overfitting

2nd - Dual Transformer

  • Models:

    • One transformer for the question text, one for the answer text

    • 2 x dual Roberta-base

    • 🌟 dual Roberta-large (2x 256 tokens, turns out halving number of tokens didn't lose much performance for roberta-large)

    • dual XLNet-base

    • 🌟 Siamese Roberta-large with weighted averaged layers (similar to 1st place solution)

    • See their post for really nice explanation and visualisation of the architectures

    • Stacking 2 Linear layers without activation performed better than single Linear layer in some cases.

    • Some of the following models have category embedding structure additionally.

  • Found differential lrs + warmup schedule was key to dual-model arch, see their post for description of their model heads (average -> concat ->  30 x 2 linear layers)

  • Same input text for all, but targets were changed to an ordinal representation (gave 170 target cols)

  • Training:

    • Differential learning rates (transformer @ 3e-5, model head(s) @ 5e-3)

    • 3 epochs, cosine scheduling, 1 epoch warmup

    • AdamW or RAdam with weight decay of 0.01

    • Effective batchsize of 8 using 🌟gradient accumulation

    • 🌟 Modified 5-fold GroupKFold strategy (see post for more)

  • Post-processing: threshold clipping

  • Equally weighted blend of the 5 model outputs

 (In my notes I saved this link here to “Parameter-Efficient Transfer Learning for NLP”, so I’m leaving it here for now until I find a better home for it)

3rd - All the models

  • Used text columns plus a category column

  • 🌟 Text truncation: i) pre-truncate, ii) post-truncate, iii) head + tail tokens ,iv) longer answer

  • Models ensemble:

    • 🌟LSTM + Universal Sentence Encoder

    • 2 x BERT-base uncased, 2 x BERT-base cased, 2 x BERT-large uncased, 2 x BERT-large cased

    • ALBERT-base, Roberta-base, GPT2-base, XLNet-base

    • Stacking 2 Linear layers without activation performed better than single Linear layer in some cases.

    • Some of the models had a category embedding structure additionally.

    • See their post for really nice visualisations of their model architectures

  • Training:

    • Min-Max target scaling

    • Large weights for minor positive or negative samples

    • 🌟 gelu_new activation for BERT-based models

    • Cosine warmup scheduler, 5-fold CV

    • EMA (eponential moving average of weights)

  • Post processing: clipped predictions with thresholds are decided by golden section search (see solution post for code example and also here for more on golden search).

  • Blending: TPE (Tree-structured Parzen Estimator) optimization using Optuna

  • Didn't work: manifold mixup, pretraining masked LM, input-response prediction

  • CODE: LSTM code here, it’s a great kernel!

4th - Triple Transformer

  • Solution post here, nice architecture overview post here and post-processing post here

  • Similar to the 2nd place solution but combining 3 transformers instead of 2

  • One different BERT + text input to predict each of:

    • 21 question targets

    • 9 answer targets

    • all 30 targets

  • Backpropagate in every transformer using 4 losses separately: loss1(answerstarget, output from bertanswers), loss2(questiontargets, output from bertquestion), loss3(targets, output from bert3 which combines answer+question), loss4 is final loss after concatenating all logits from different transofmers)

  • Do the same again for XLNet

  • Flexible rating module for encoding 2 texts, e.g.:

if len(text1) < max len1 & Len(text2) > max len2:
max len2 = max len2 + (max_len1 - len(text1)) 

5th - Dual models

  • Based on Internet Sentiment Analysis solution: BDCI2019-SENTIMENT-CLASSIFICATION (link in Mandarin)

  • Split into a 2 models, 1 for question targets and 1 for answer targets

  • Roberta-large, Roberta-base, XLNet used

  • Added additional host, categorial and statistical features e.g. word count, answer word count vs question word count)

  • Text cleaning to remove stop words and some symbols

  • Target output blend: 0.4 x Roberta-large + 0.3 Roberta-base + 0.3 XLNet base

Thats the Top 5 solutions from the Google QUEST competition, see here if you’d like to see the winners’ own summaries and here for a summary of solutions beyond the Top 5. Now to start the DeepFakes competition!!

As always, any feedback on this or my other posts is always welcome, you can reach my on Twitter: @mcgenergy.

[4/4][Kaggle] The LSTM Model - 2nd Place Solution Breakdown - RSNA Intracranical Hemorrhage Competition