The competition
In the Jigsaw Multilingual Toxic Comment Classification Kaggle, the goal is to classify comments as either toxic or non-toxic. For example, a comment like:
Uh, yeah it does, dumbass. Haven't you ever edited an article before to make it wordier and harder to understand?
would be toxic, and a comment like:
Some baklava for you!
Thanks for your copyedit of The Long, Hot Summermuch appreciated!
would be non-toxic.
The comments come in 7 different languages: English (en), Turkish (tr), Portuguese (pt), Russian (ru), French (fr), Italian (it), and Spanish (es). 3 main datasets are provided, denoted here by Train, Valid, and Test. Train contains comments in en, and each comment has been labelled 0 if it's non-toxic, and 1 if it's toxic. The comments in Valid are in tr, it, es, and they are unlabelled.
Train | Valid | Test | |
---|---|---|---|
#(comment) | 223,549 | 8,000 | 63,812 |
language(s) | en | tr, it, es | tr, pt, ru, fr, it, es |
Test contains comments in tr, pt, ru, fr, it, and es. For each comment in Test, the model should output a number between 0 and 1 that represents the probability of the comment being toxic. The predictions on Test are submitted to Kaggle for evaluation.
A main challenge of this Kaggle is that there are no common languages between Test, the target dataset, and Train, the provided training set.
For this competition, there is not a private, withheld set for the evaluation. The model's performance on Test contributes entirely to the eventual leaderboard, with the evaluation score on 30% of Test shown on the public leaderboard during the competition.
The solution
Each model used in this solution consists of a language model with the last few layers customised and a head that outputs a number between 0 and 1, representing the probability of the input text being a toxic comment.
The language models used include bi-directional GRUs from fasttext
and transformers from transformers
. Amongst the transformers, some models are multilingual, so text in numerous languages can simply be input into them. Most models are monolingual.
The main steps of the solution are as follows.
Train multilingual model on English examples
The multilingual model XLM-Roberta-large (XLM-R) is trained on Train, which has only en comments. Because of the very large number of pretrained parameters present in the model, and because binary classification is a relatively simple task, not much hyperparameter tuning is required. The training is carried out at a constant, small learning rate for a few epochs.
Pseudo-label translations of English examples
A subset of 159,473 comments from Train are first selected (though it's not clear how they are selected), with each of these translated into tr, pt, ru, fr, it, and es, using the Google Translate API. Then, the XLM-R just trained above is used to predict on each comment (in en), and the prediction is used as pseudo-labels for the comment's 6 translations. The translations of all the comments and their pseudo-labels make up what's referred to as the Combined Distilled dataset.
dataset name | Combined Distilled |
---|---|
comment source | Train |
#(unique comment) | 159,473 |
#(comment) | 6 x 159,473 |
language(s) of each unique comment | tr, pt, ru, fr, it, es |
label source | XLM-R on Train or some other baseline |
#(duplicated label) | 5 x 159,473 |
Essentially, translations in 6 languages have replaced the en comments, and the hard labels (0 and 1) have been replaced with pseudo-labels, model predictions on the en comments.
Pseudo-labelled Test
The same XLM-R is also used to predict on Test, with the output predictions used as pseudo-labels on Test itself. This dataset will be referred to as Pseudo-labelled Test.
dataset name | Pseudo-labelled Test |
---|---|
comment source | Test |
#(unique comment) | 63,812 |
#(comment) | 63,812 |
language(s) of each unique comment | tr, pt, ru, fr, it, es |
label source | aggregated updates from monolingual and multilingual models |
#(duplicated label) | 0 |
Initially, generating pseudo-labels for all comments in Test in one go is only possible because XLM-R is a multilingual model that works for all those languages in Test. And even though here the XLM-R has made predictions for comments in tr, pt, ru, fr, it, and es, it hasn't yet been trained on any examples in those languages, only the English comments in Train.
Unlike in Combined Distilled, there are no translations in Pseudo-labelled Test, so each comment's pseudo-label is unique, a number beteen 0 and 1.
Pseudo-labelled Test is updated through further training, but initially, this is how it's formed.
Pseudo-label training with multiple language models
Having obtained Combined Distilled and Pseudo-labelled Test, they are concatenated into a larger dataset on which multiple rounds of training will be carried out.
In each round of training, either a monolingual model or a multilingual model can be used. When a monolingual model is used, everything is restricted to that model's particular language. For example, if the model used is a fr language model, the first thing to do will be to select only those examples where the comment is in fr and then only use those examples during training. At the end of every epoch, the model's prediction on Test is saved. After having gone through all epochs of training, these saved predictions are combined in a simple mean.
This mean is used to update the pseudo-labels of the current Pseudo-labelled Test. The update is carrid out through an exponentially weighted average: $$ \hat{y} = \alpha \hat{y} + (1 - \alpha) \hat{y}_{run} \quad . $$ \(\hat{y}\) is the current pseudo-label of Pseudo-labelled Test. \(\hat{y}_{run}\) denotes the mean prediction of the current round of training mentioned just above. \(\alpha = 0.5\) is used in this solution. Note again that if the current training round uses a fr model, then only the pseudo-labels of the fr examples are updated, with the rest of the examples staying the same.
After this update, Pseudo-labelled Test can be submitted for evaluation by Kaggle.
Everything described so far in this section constitutes one training round. Successive rounds with different language models are carried out. For a complete list of monolingual transformers used in this solution, see this table. XLM-R is the only multilingual model used.
Note that while Combined Distilled is the same in all rounds, Pseudo-labelled Test changes. As it improves, the next training round will have better data to train on, and the submission will have a higher evaluation score.
Post-processing submissions
In addition to improving the evaluation score by iteratively improving the pseudo-labels on Test, it's possible to combine several existing submissions into one that ought to gain an even higher evaluation score. This is illustrated here for a simple scenario where there are 4 existing submissions, with evaluations scores 0.9934, 0.9935, 0.9936, and 0.9940.
In TABLE 4, these 4 submissions are arranged in ascending order of evaluation score, and the first 2 pseudo-labels, \(\hat{y}_0\) and \(\hat{y}_{1}\), are shown for each submission.
submission 3 | submission 2 | submission 1 | submission 4 | |
---|---|---|---|---|
PL score | 0.9934 | 0.9935 | 0.9936 | 0.9940 |
\(\hat{y}_0 \) | 0.3 | 0.5 | 0.6 | 0.8 |
\(\hat{y}_1 \) | 0.8 | 0.6 | 0.6 | .4 |
The first thing to do is roughly gauge how the pseudo-labels change with increasing score, by calculating the average difference between adjacent submissions, \(\Delta\). For the first pseudo-label, \(\hat{y}_0\), this is: $$ \Delta = \frac{(0.5 - 0.3) + (0.8 - 0.6)}{2} = 0.2 $$ and for the second pseudo-label, it's: $$ \Delta = \frac{(0.6 - 0.8) + (0.4 - 0.6)}{2} = - 0.2 $$ \(\hat{y}_0\) has an upward trend of 0.2, whilst \(\hat{y}_1\) has a downward trend of 0.2.
This suggests that the score is likely to increase further if \(\hat{y}_{0}\) is further increased and if \(\hat{y}_1\) is further decreased. The exact adjustment is determined by the following expression: $$ \hat{y}_{new} = \left\{ \begin{array}{rl} (1 - \alpha \Delta) \hat{y}_{best} + \alpha\Delta & \text{if } \Delta \ge 0 \\ (1 + \alpha \Delta) \hat{y}_{best} & \text{otherwise} \quad . \end{array} \right. $$ Notice that it limits the adjusted pseudo-label to within the range between 0 and 1, as it needs to be. Applying this expression to the example here, the new value for the first pseudo-label should be: $$ \hat{y}_0 = (1 - 0.2\alpha) \cdot 0.8 + 0.2\alpha = 0.84 $$ and the new value for the second pseudo-label should be: $$ \hat{y}_1 = (1 - 0.2\alpha) \cdot 0.4 = 0.32 $$ As expected, \(\hat{y}_0\) has increased, and \(\hat{y}_1\) has decreased. \(\alpha = 1\) is used here, but in practice, any value between 1 and 2 can be used.
Applying this post-processing technique to submissions has proven to help increase the overall evaluation score on the leaderboard.
Team Lingua Franca are rafiko1 + leecming.