Introduction

Label smoothing was introduced by Szegedy et.al in the paper Rethinking the Inception Architecture for Computer Vision. Since then, this trick has been used in many papers to improve the accuracy of various architectures. Although being widely used, there was less insight as to why this technique helps the model to perform better, but the paper by Rafael Müller et.al When does Label Smoothing Help? answers the question of "What does label smoothing do?" and "Why does it help the model?". This blog post is an attempt to explain the main result of the paper.

1. What Is Label Smoothing?

Generally, in a classification problem, our aim is to maximize the log-likelihood of our ground-truth label. In other words, we want our model to assign maximum probability to the true label given the parameters and the input i.e we want ${P(y\mid x,\theta)}$ to be high, where the ${y}$ is known beforehand. We motivate our model to achieve this by minimizing the cross-entropy loss between the predictions our model outputs and the ground truth labels. Cross entropy loss is defined by the equation: ${L(y,\hat y)=-\sum_{i=1}^{n} y_{i} \times log(\hat y_{i}) }$ where n is the number of classes, ${y_{i}}$ is 1 if image belongs to class ${i}$ and 0 otherwise, and ${\hat y_{i}}$ is probability of ${y_{i}}$ being 1. Don't be intimidated by the daunting equation and jargon because in reality the calculation of loss is very easy. Suppose you build a model for task of image-classification where an image can belong to one of the 3 classes. For every image as the input the model outputs a 3-dimensional vector. Let's say for a particular image the model's normalised output is ${\hat y = [0.2, 0.7,0.1]}$ and the image belongs to category 2. Therefore, the target vector for that image will be ${y = [0,1,0]}$. The loss for this image will be ${-(0\times \log 0.2 + 1\times \log 0.7 + 0\times \log 0.1) = -\log 0.7=0.35}$ which is low because our model assigns high probability to groud truth label. If instead our predictions are ${ \hat y=[0.8,0.1,0.1]}$, then the loss will be ${-\log 0.1=2.3}$ which is high because now our model assigns low probability to ground-truth label.

There is little more to how the normalised predictions of the model are calculated. The model's predictions are calulated by applying the Softmax activation on the last layer's output. The model outputs a 3-dimensional vector and each of the element of the vector is called 'logit'. For the logits to represent a valid probability distribution over the classes they should sum to 1. This is accomplished by passing the logits through a softmax layer. Let's say the output vector for a certain image as input is ${z = [z_{1}, z_{2},...,z_{n}]}$ then the predictions are calculated as ${\hat y = \text Softmax \left(z \right) = \large [\frac {e^{z_{1}}}{\sum_{i=1}^{n} e^{z_{i}}}, \frac {e^{z_{2}}}{\sum_{i=1}^{n} e^{z_{i}}},..., \frac {e^{z_{n}}}{\sum_{i=1}^{n} e^{z_{i}}}]}$.${(Eq\,1.1)}$ Notice that sum of all the elements of ${\hat y}$ is 1. Suppose the ground truth label for the image is 2, then the target vector is ${[0,1,0,0,....0]}$ (The length of target vector is n as well). Thus, the Cross-entropy loss for this image, in it's full glory is written as ${\text loss\left(y,z\right) = -1 \times \normalsize \log \frac {e^{z_{2}}}{\sum_{i=1}^{n} e^{z_{i}}} = \log {\sum_{i=1}^{n} e^{z_{i}}} - z_{2}}$. Minimising this loss encourages ${z_{2}}$ to be as high as possible while ${z_{i}}$ for ${i\ne2}$ are encouraged to be low. Szegedey et.al highlight two problems with this approach:

  1. The first problem with this approach is that model becomes over-confident for it's predictions as it learns to assign nearly 100% probability to the ground-truth label. Szegedy et. al argue that this can lead to overfitting and model may not be able to generalize well. Intuitively this makes sense. for.eg Let's say our dataset contains two symantically similar classes class1 and class2(pets dataset has plenty of those). Unfortunately our dataset contains many instances of class1 but relatively less instances of class2. Suppose image1 belongs to class1 and image2 to other. Because these images are very similar, the output logits of these images would be very similar. Our over-confident model may assign class1 to the image2 with high confidence(close to 100% probability) and this can incur heavy validation loss.

  2. The other problem with this approach is the vanishing gradient. The gradient of our loss w.r.t logit of correct class label k is ${\large \frac {e^{z_{k}}}{\sum_{i=1}^{n} e^{z_{i}}}-1}$ and w.r.t other logits is ${\large \frac {e^{z_{i}}}{\sum_{i=1}^{n} e^{z_{i}}}}$. Minimising the cross-entropy loss leads to logit corresponding to correct class to be much higher than other logits. This leads to vanishing of gradients of loss w.r.t other logits and thus it hinders the model's ability to adapt.

What can we do to counteract these two problems? Szegedy et.al suggest that we shouldn't provide sparse one-hot encoded vectors as target. Instead we should "smoothen" them. This is done by replacing the probability distribution over labels from dirac delta distribution to a linear combination of dirac delta distribution and a uniform distribution. This may sound incredibly complex to hear but in reality is very easy to implement. Let's define what the above jargon means.

Dirac delta function denoted by ${\delta _{i,l}}$ is a function which is 1 for ${i=l}$ and 0 everywhere else. If an image has class ${l=3}$ as it's label and there are ${k=4}$ classes in total, then the target vector for that image has the probability distribution ${\delta _{i,3}}$ for ${i=1\,to\,k}$. ${i}$ represents index of the target vector (I haven't used 0 indexing) and therefore the target vector is ${[0,0,1,0]}$. Notice that ${\delta _{i,l}}$ is a valid probability distribution as it sums to 1 over it's domain. A uniform distribution is a distribution which has a constant value over it's domain. Let's say our domain consists of ${\{x\epsilon[1,4]: x\epsilon I\}}$. This is read as x belongs to 1 to 4 both included such that x is an integer. So, ${x\epsilon \{1,2,3,4\}}$. Uniform distribution over this domain is denoted as ${U\left(x\right)}$. ${\therefore U\left(1\right) = U\left(2\right) = U\left(3\right) = U\left(4\right) = c}$. The sum over the domain i.e ${\sum_{i=1}^{4} U(i)}$ is 4c. For ${U(x)}$ to be a valid probability distribution ${4c}$ should equal to 1. ${\therefore c=0.25}$. More generally we can say that if there are ${k}$ points in our domain, then uniform distribution over the domain would be ${U(i)=\frac{1}{k}}$ where ${i}$ is any point in the domain.

Let's denote the distribution over our labels for a particular image as ${q\left(i\right)}$ where ${i=1\,to\,k}$. ${k}$ denotes the total no of classes and ${l}$ denotes the true label for the image. Generally, ${q\left(i\right) = \delta _{i,l}}$. Szegedy et. al propose to replace ${\delta _{i,l}}$ with ${(1-\varepsilon)\delta _{i,l} + \varepsilon U\left(i\right)\,for\,i=1\,to\,k}$ where ${\varepsilon}$ is a hyperparameter. As explained above, value of ${U\left(i\right)}$ should be ${\frac {1}{k}\,for\,i=1\,to\,k}$. Then, our new distribution over labels is ${q'(i) = (1-\varepsilon)\delta _{i,l} + \frac{\varepsilon}{k}}$ $(Eq\,1.1)$. Let's see how to do this using an example.

Suppose distribution over target labels of an image say image1 for a classification task which has ${k=4}$ classes is ${q\left(i,2\right)=\delta_{i,2}\, for \,i = 1\,to\,k}$. ${i}$ here represents index of target vector. Thus, target vector will be ${y^{h}=[0,1,0,0]}$. Then, our new distribution over labels according to ${Eq\,1.1}$ is ${q'\left(i,2\right) = (1-\varepsilon)\delta _{i,2} + \frac{\varepsilon}{4}}$ for ${i=1\,to\,4}$. Subsequently, smoothened target vector ${y^{l}}$ will be ${[\frac{\varepsilon}{4},(1-\varepsilon)+\frac{\varepsilon}{4},\frac{\varepsilon}{4},\frac{\varepsilon}{4}]}$ = ${[0.25\varepsilon, 1-\varepsilon+0.25\varepsilon, 0.25\varepsilon,0.25\varepsilon]}$. If ${\varepsilon = 0.2}$,then ${y^{l} =[0.05,0.85,0.05,0.05]}$. Notice that elements of new smoothened label vector still sum to 1, which confirms that ${(1-\varepsilon)\delta _{i,l}+\varepsilon U\left(i\right)}$ is a valid probability distribution over the labels.

Let's see what difference does it make to change the labels in the way shown above. Suppose our model outputs the prediction vector ${p_{1}=[0.05,0.9,0.03,0.02]}$ for image1. So the model is really confident that this image has label 2 which is a good thing since this image really does has label 2. The loss with smoothened labels ${y^{l}}$ will be ${L(y^{l},p_{1})= -(0.05\log 0.05+0.85\log0.9+0.05\log0.03+0.05\log0.02)= 0.61}$. Now suppose our model didn't output ${p_{1}}$ but ${p_{2}=[0.01,0.79,0.15,0.05]}$. In this case it is less sure that the image has label 2. Loss will be ${L(y^{l},p_{2})= -(0.05\log0.01+0.85\log0.79+0.05\log0.15+0.05\log0.05)=0.29}$ which is less than the loss with ${p_{1}}$ ! This goes on to show that smooth labels want the model to be confident about it's predictions but not over-confident.

Intuitively we can think of label smoothing as a process to reduce the confidence of model in it's ground truth labels.The ground truth labels may sometimes be awry owing to errors in data labelling or data collection process. Label smoothing can make the model robust against those incorrect labels.

2. Implementation In Code

To implement label smoothing, we don't change every label individually but we define a new loss function. Loss function is still Cross-entropy loss but our target vector for every image changes. Our new target vector for a particular image is ${y^{l} = [\frac {\varepsilon}{k},\frac {\varepsilon}{k},...,(1 - \varepsilon) + \frac{\varepsilon}{k},\frac {\varepsilon}{k},\frac {\varepsilon}{k},...k times]}$. Let's assume the image belongs to class ${j}$. Normal one hot encoded target label will have 1 at j position and 0 everywhere else. Let's denote it as ${y^{h}}$. So, ${y^{h} = [0,0,0,...,1,0,...0]}$.

The loss with ${y^{h}}$ is ${L(y^{h},\hat y)= -\log \hat y_{j}}$. ${Eq\,2.1}$

The loss with new smoothened labels is ${L\left(y^{l},\hat y\right) = \sum_{i=1}^{k} -y_{i}^{l}\log \hat y_{i}}$ = ${- \left( \frac {\varepsilon}{k}\log\hat y_{1} +...+ \left(1-\varepsilon+ \frac{\varepsilon}{k}\right)\log\hat y_{j}+\frac {\varepsilon}{k}\log\hat y_{j+1}+...+\frac {\varepsilon}{k}\log\hat y_{k}\right)}$. We can rewrite this as ${L\left(y^{l},\hat y\right) = -\left(1-\varepsilon\right)\times\log\hat y_{j} - \frac{\varepsilon}{k}\times\left(\sum_{i=1}^{k} \log\hat y_{i}\right)}$. Eagle eyed reader can notice that the term which is multiplied by ${\left(1 - \varepsilon\right)}$ is the cross-entropy loss calculated with one hot encoded target vector. Therefore, ${L\left(y^{l},\hat y\right) = \left(1-\varepsilon\right)L\left(y^{h},\hat y\right)-\frac{\varepsilon}{k}\left(\sum_{i=1}^{k} \log\hat y_{i}\right)}$. ${Eq\,2.2}$

So, we only need to modify the loss function of our model and we are good to go. The implementation of this in code is shown below. The code snippet below uses Pytorch framework and implementation is copied from the fast.ai course.

#collapse-show
from torch import nn

def lin_comb(a1,a2,factor):
    '''This function calculates linear combination of 
    two quantities a1 and a2 where the respective
    coeffecients are factor and (1-factor)'''
    return factor*a1 + (1-factor)*a2

def reduce_loss(loss, reduction='mean'):
    '''We need this function because we generally calcualate 
    losses for a batch of images and take the mean or sum all the 
    losses. But throughout this blog we input only a single image 
    in the model so you can ignore this fuction and just assume 
    that this funtion does nothing. for.eg reduce_loss(2)=2'''
    return loss.mean() if reduction=='mean' else loss.sum() if reduction=='sum' else loss    

class LabelSmoothing(nn.Module):
    def __init__(self, f:float=0.1, reduction = 'mean'):
        super().__init__()
        self.f = f #factor for linear combination
        self.reduction = reduction #You can safely ignore this
    
    def forward(self,pred,targ):
        #this line of code implements Eq 1.1
        ls = F.log_softmax(pred, dim = 1) 
        #this line of code calculates the sum part of second term in Eq 2.2
        l1 = reduce_loss(ls.sum(1), self.reduction)
        #this line of code calculates Eq 2.1
        l2 = F.nll_loss(ls, targ,reduction= self.reduction)
        #finally this line implements Eq 2.2
        return lin_comb(-l1/pred.shape[-1],l2,self.f) 

3. How And Why Does It Work?

Label smoothing goes against the conventional practice of maximising the likelihood of ground truth label. Instead, it punishes the model if the logits which don't correspond to correct label get too low. This can be seen by the second term in equation of loss mentioned above i.e ${-\frac{\varepsilon}{k}\left(\sum_{i=1}^{k} \log\hat y_{i}\right)}$. We can see that if ${\hat y_{i}\, for\, i = {1,2,...,k}}$ go too close to 0 then the loss goes up (${\log}$ of something close to 0 is a large negative number). In contrast, maximising the likelihood of one-hot encoded ground-truth label encourages the logits that don't correspond to correct label to go as low as possible. With smooth labels ${y^{l}}$ our aim is to maximise ${P(y^{l}\mid x,\theta)}$. Let's see why maximising the likelihood of smooth labels instead of maximising the likelihood of one-hot encoded labels is benificial for our model.

Calculating Loss Without Label Smoothing

Let's imagine that we have a task to build a model for image classification task where each image can have one of three labels. This means our model will output a 3-dimensional vector containing our three logits. Assume that penultimate layer of the model has 4 activations. We put in an image in this model which has a target vector ${y^{h} = [0,1,0]^{T}}$. The penultimate layer's activations are ${X = [x_{1},x_{2},x_{3},x_{4}]^{T}}$, the last layer's outputs are ${Z = [z_{1},z_{2},z_{3}]^{T}}$ (A single vector is conventionally written as column vector, therefore, ${X}$, ${Z}$ and ${y^{h}}$ are written as transpose of row vectors). ${Z}$ is calculated from the penultimate layer's activation using the equation ${Z = W\star X}$ (${\star}$ here denotes matrix multiplication). Bias is ignored for sake of brevity. ${W}$ is the weight matrix connecting penultimate layer and output layer. ${W = \left[ \begin{array}{ccc} w_{11} & w_{12} & w_{13} & w_{14} \\ w_{21} & w_{22} & w_{23} & w_{24}\\ w_{31} & w_{32} & w_{33} & w_{34} \end{array} \right]}$. Shortly weight matrix can be written as ${W = [w_{1},w_{2},w_{3}]^{T}}$ where ${w_{i} = [w_{i1},w_{i2},w_{i3},w_{i4}]}$. The output vector ${Z}$ is calculated as ${W\star X = \left[ \begin{array}{ccc} w_{11}\times x_{1} & w_{12}\times x_{2} & w_{13}\times x_{3} & w_{14}\times x_{4} \\ w_{21}\times x_{1} & w_{22}\times x_{2} & w_{23}\times x_{3} & w_{24}\times x_{4}\\ w_{31}\times x_{1} & w_{32}\times x_{2} & w_{33}\times x_{3} & w_{34}\times x_{4} \end{array} \right]}$. In short this can be written as ${Z = \left[\begin{array}{ccc} z_{1} \\ z_{2} \\ z_{3} \end{array} \right] = \left[ \begin{array}{ccc} w_{1}X^{T} \\ w_{2}X^{T} \\ w_{3}X^{T} \end{array} \right]}$ where ${w_{i}X^{T}}$ denotes inner product between ${w_{i}}$ and ${X^{T}}$. ${Z}$ is a vector of logits and is un-normalised. To get our prediction vector we would have to normalise this by passing ${Z}$ through a softmax layer. Our prediction vector would be ${\hat y = \left[ \begin{array}{ccc} \frac {e^{w_{1}X^{T}}}{e^{w_{1}X^{T}}+e^{w_{2}X^{T}}+e^{w_{3}X^{T}}} \\ \frac {e^{w_{2}X^{T}}}{e^{w_{1}X^{T}}+e^{w_{2}X^{T}}+e^{w_{3}X^{T}}} \\ \frac {e^{w_{3}X^{T}}}{e^{w_{1}X^{T}}+e^{w_{2}X^{T}}+e^{w_{3}X^{T}}} \end{array} \right]}$. As given before, our target vector is ${y^{h} = [0,1,0]^{T}}$. So, our cross-entropy loss will be ${L\left(y^{h},\hat y\right) = -\log \left(\frac {e^{w_{2}X^{T}}}{e^{w_{1}X^{T}}+e^{w_{2}X^{T}}+e^{w_{3}X^{T}}}\right)}$. For preserving our sanity let's denote ${e^{w_{1}X^{T}}+e^{w_{2}X^{T}}+e^{w_{3}X^{T}}}$ by ${S}$. Then, ${L\left(y^{h},\hat y\right) = -\log \left(\frac {e^{w_{2}X^{T}}}{S}\right) = \log {S}-{w_{2}X^{T}}}$

Calculating Loss With Label Smoothing

Our prediction vector is same as before, but our target vector changes. Let's denote our label smoothed target vector as ${y^{l}}$. So, ${y^{l} = [\frac {\varepsilon}{3}, 1-\varepsilon + \frac {\varepsilon}{3}, \frac {\varepsilon}{3}]^{T}}$. Then, our new loss will be ${L\left(y^{l},\hat y\right) = -\frac {\varepsilon}{3}\times \log\frac {e^{w_{1}X^{t}}}{S}-\left(1-\varepsilon+\frac{\varepsilon}{3}\right)\times \log\frac {e^{w_{2}X^{t}}}{S}-\frac {\varepsilon}{3}\times \log\frac {e^{w_{3}X^{t}}}{S}}$. Grouping the varibles appropriately,

${L\left(y^{l},\hat y\right)= -\left(1-\varepsilon\right)\times \log\frac {e^{w_{2}X^{t}}}{S}-\frac {\varepsilon}{3}\times\left(\log\frac {e^{w_{1}X^{t}}}{S}+\log\frac {e^{w_{2}X^{t}}}{S}+\log\frac {e^{w_{3}X^{t}}}{S}\right)}$. Remember that ${\log a + \log b = \log ab}$. Utilising this rule, loss can be written as ${L\left(y^{l},\hat y\right)=\left(1-\varepsilon\right)\left(\log S-w_{2}X^{T}\right)-\frac{\varepsilon}{3}\times{\log\left(\frac{e^{w_{1}X^{T}+w_{2}X^{T}+w_{3}X^{T}}}{S^{3}}\right)}}$. To further reduce this equation, we need to know two more rules:

  1. ${\log\frac{a}{b}=\log a-\log b}$ and
  2. ${\log a^{b}=b\log a}$.

Then, ${L\left(y^{l},\hat y\right)=\left(\log S-w_{2}X^{T}\right)-\varepsilon\left(\log S - w_{2}X^{T}\right)-\frac{\varepsilon}{3}\left(w_{1}X^{T}+w_{2}X^{T}+w_{3}X^{T}\right)+\frac{\varepsilon}{3}\log\left(S^{3}\right)}$.

Expanding the second term in this expression and using rule 2 we get, ${L\left(y^{l},\hat y\right)=\left(\log S-w_{2}X^{T}\right)-\varepsilon\log S+\varepsilon\left(w_{2}X^{T}\right)-\frac{\varepsilon}{3}\left(w_{1}X^{T}+w_{2}X^{T}+w_{3}X^{T}\right)+{\varepsilon}\log\ S}$. Notice, that first term of the last expression is our ${L\left(y^{h},\hat y\right)}$. Therefore, our loss with smooth labels can be finally written as ${L\left(y^{l},\hat y\right)=L\left(y^{h},\hat y\right)+\frac{\varepsilon}{3}\left(2w_{2}X^{T}-w_{1}X^{T}-w_{3}X^{T}\right)}$.

4. Geometric Point Of View

(X -> Penultimate layer's activation)

Our last layer's output for the image we input earlier is ${Z= \left[ \begin{array}{ccc} w_{1}X^{T} \\ w_{2}X^{T} \\ w_{3}X^{T} \end{array} \right]}$. Since this image belongs to class 2, minimising any of the loss functions calculated above increases ${w_{2}X^{T}}$ while ${w_{1}X^{T}}$ and ${w_{3}X^{T}}$ are decreased. More generally, if an image belongs to class ${k}$ then in minimising the loss, ${z_{k}=w_{k}X^{T}}$ is increased while every other logit is decreased. Also, notice a pattern that ${w_{i}}$ produces logits for class ${i}$ using the operation ${w_{i}X^{T}}$. Hence, ${w_{i}}$ can be thought of as a template for class ${i}$. So,from now on I'll sometimes refer to ${w_{i}}$ as template for class ${i}$. Let's try to view the process of minimising or maximising ${w_{i}X^{T}}$ geometrically.

Euclidean Norm

Euclidean norm of two vectors is simply the distance between the two vectors in their space. Euclidean Norm for two vectors ${a}$ and ${b}$ can be calculated as: ${\lVert a-b\rVert=\left(a^{T}\star a-2a^{T}\star{b}+b^{T}\star b\right)^{\frac{1}{2}}}$. ${\therefore \lVert a-b\rVert^{2}= a^{T}\star a-2a^{T}\star{b}+b^{T}\star b}$. (Remeber that ${\star}$ denotes matrix multiplication.)

Loss Mimisation as Distance Minimisation/Maximisation

Now that we know how to calculate the euclidean norm, let's calculate it for ${w_{i}}$ and ${X}$. ${\lVert w_{i}-X\rVert^{2}= w_{i}^{T}\star w_{i}-2w_{i}^{T}\star{X}+X^{T}\star X= w_{i}^{T}\star w_{i}-2w_{i}{X}^{T}+X^{T}\star X}$. (For any two vectors ${a}$ and ${b}$, ${a\star b=a.b^{T}}$ where ${\star}$ and ${.}$ denote matrix multiplication and inner product respectively). Geometrically, this quantity is square of the distance between template for class ${i}$ and penultimate layer's activation ${X}$.

Notice the second term inside the expression of ${\lVert w_{i}-X\rVert^{2}}$ which is ${2w_{i}X^{T}}$. If this term increases, the distance between ${w_{i}}$ and ${X}$ decreases and whenever it decreases the mentioned distance increases. But notice that this second term is just the same as ${2\times z_{i}}$. This means whenever ${z_{i}}$ increases/decreases, distance between ${w_{i}}$ i.e tempelate for class ${i}$ and ${X}$ i.e penultimate layer's output vector decreases/increases. If an image belongs to class ${k}$, minimising the loss increases ${z_{k}}$ and decreases every other logit. This means that minimising the loss is same as minimising the distance between penultimate layer's output ${X}$ and template for correct class ${w_{k}}$ and maximising the distance between ${X}$ and template for every incorrect class i.e ${w_{i}}$ where ${i \neq k}$.

Thus, we can infer that minimising ${L\left(y^{h},Z\right)}$ or ${L\left(y^{l},Z\right)}$ produces the same effect which is to bring ${w_{k}}$ close to ${X}$ when image belongs to class ${k}$ and taking ${w_{i}}$ where ${i\neq k}$ far from ${X}$. The different performance of these two losses stem from the manner in which they go about doing this which is explained below.

5. Derivatives Of Losses Tell The Difference.

Now let's, painstakingly write the two losses without any abridgement.

${L\left(y^{h},\hat y\right)=\log\left(e^{w_{1}X^{T}}+e^{w_{2}X^{T}}+e^{w_{3}X^{T}}\right)-w_{2}X^{T}}$

${L\left(y^{l},\hat y\right)= \log\left(e^{w_{1}X^{T}}+e^{w_{2}X^{T}}+e^{w_{3}X^{T}}\right)-w_{2}X^{T}+\frac{\varepsilon}{3}\left(2w_{2}X^{T}-w_{1}X^{T}-w_{3}X^{T}\right)}$.

The reason we wrote the losses like this is because written this way, it'll be easy to take their derivatives w.r.t any term we want. We know that we minimise the loss using gradient descent. Imagine the loss surface as a convex surface (like a hollow ball cut in half and it's upper hemisphere removed). Our aim is to go to the lowest point in this convex region where the loss is lowest. We go to this point by continously changing our parameters using the gradient descent rules. Now, at this point we need to remember some rules from calculus.

  1. Let's say there's a function ${f\left(x\right)}$. It's derivative w.r.t x, ${\normalsize\frac{df}{dx}}$ can be denoted as ${f'(x)}$. Notice that derivative is also a function of x. Suppose ${f\left(x\right)}$ is at it's minimum at point ${x^{\ast}}$. Then, ${f'(x^{\ast})=0}$.

  2. ${\normalsize\frac{d\log x}{dx}=\frac{1}{x}}$

Derivative of ${L\left(y^{h},\hat y\right)}$

Let's imagine that we trained our model using the loss ${L\left(y^{h},\hat y\right)}$ and through meticulous training we have reached the minimum point on our loss surface i.e our loss is lowest it can be (Sadly, in practice this doesn't happen but we still assume this because by doing so we can infer how the parameters behave in order to reach the holy grail i.e global minima or a satisfactory local minima). The value of ${W}$ at minima is ${W^{\ast}=\left[ \begin{array}{ccc} w_{1}^{\ast} \\ w_{2}^{\ast} \\ w_{3}^{\ast} \end{array} \right]}$.

Now we take derivative of ${L\left(y^{h},\hat y\right)}$ w.r.t ${W}$. The derivative is written as ${L'_{h}\left(W\right)=\frac{\delta L_{h}}{\delta W}= \left[ \begin{array}{ccc} \frac{\delta L_{h}}{\delta w_{1}} \\ \frac{\delta L_{h}}{\delta w_{2}} \\ \frac{\delta L_{h}}{\delta w_{3}} \end{array} \right]}$ (${L_{h}}\, denotes\, L\left(y^{h},\hat y\right) $). Since ${L_{h}}$ is composed of two variables ${X}$ and ${w_{i}}$, it's derivative w.r.t one of the variables is written with delta (${\delta}$) sign. This sign simply denotes that while taking derivative of a function w.r.t to a variable treat the other variable as constant. Since we are taking derivative w.r.t ${w_{i}}$ we will treat ${X}$ as constant.

  • ${\large\frac{\delta L_{h}}{\delta w_{1}}=\frac{e^{w_{1}X^{T}} X^{T}}{S}}$.
  • ${\large\frac{\delta L_{h}}{\delta w_{2}}=\frac{e^{w_{2}X^{T}} X^{T}}{S}-\normalsize X^{T}}$.
  • ${\large\frac{\delta L_{h}}{\delta w_{3}}=\frac{e^{w_{3}X^{T}} X^{T}}{S}}$.

Now from rules of calculus we know that ${\large\frac{\delta L_{h}}{\delta w_{1}}=\large\frac{\delta L_{h}}{\delta w_{2}}=\large\frac{\delta L_{h}}{\delta w_{3}}=\normalsize0}$ at ${W^{\ast}}$.

${\therefore\large\frac{e^{w_{1}^{\ast}X^{T}} X^{T}}{S}=0 \implies e^{w_{1}^{\ast}X^{T}}=0\implies \normalsize w_{1}^{\ast}X^{T}=-\infty\,(Eq.1)}$. Similarly, ${\normalsize w_{3}^{\ast}X^{T}=-\infty\,(Eq.2)}$. The case is different with ${w_{2}}$ though because it is the tempelate corresponding to the correct class. ${\large\frac{e^{w_{2}^{\ast}X^{T}} X^{T}}{S}-\normalsize X^{T}=0\implies \frac{e^{w_{2}^{\ast}X^{T}}}{e^{w_{1}^{\ast}X^{T}}+e^{w_{2}^{\ast}X^{T}}+e^{w_{3}^{\ast}X^{T}}}=1}$ ${(Eq\,3)}$. ${(Eq\,3)}$ implies that ${w_{1}^{\ast}}$ and ${w_{3}^{\ast}}$ are negligible when compared to ${w_{2}^{\ast}}$. ${Eq.1}$ and ${Eq.2}$ show that ${L_{h}}$ is minimum when the distance between tempelate of incorrect labels ($w_{1}^{\ast}$,$w_{2}^{\ast}$) and ${X}$ is ${\infty}$. From this we can infer the behaviour inflicted upon the weights connecting penultimate layer and Final Layer by reducing the loss ${L_{h}}$. Minimising this loss takes the weights corresponding to incorrect class away from penultimate layer's activations without any bounds. i.e ${X}$ and the templates of incorrect classes really begin to hate each other and go as far away from each other as possible.

Derivative of ${L\left(y^{l},\hat y\right)}$

This time we train the model using the loss ${L\left(y^{l},\hat y\right)}$ and again reach the impractical situation where we are at the global minimum or a satisfactory local minimum of the loss surface. The ${W}$ at this point is ${W^{\star}}$. Note that this ${W^{\star}}$ is different from ${W^{\star}}$ of previous subsection because our loss surface is different. (Apologies if you get confused due to notation. ${W}$ is a variable while ${W^{\star}}$ is a fixed value of that variable which occurs at minima).

The derivative of loss w.r.t ${W}$ is given as ${L'_{l}\left(W\right)=\frac{\delta L_{l}}{\delta W}= \left[ \begin{array}{ccc} \frac{\delta L_{l}}{\delta w_{1}} \\ \frac{\delta L_{l}}{\delta w_{2}} \\ \frac{\delta L_{l}}{\delta w_{3}} \end{array} \right]}$ (${L_{l}}\, denotes\, L\left(y^{l},\hat y\right)$).

  • ${\large\frac{\delta L_{l}}{\delta w_{1}}=\frac{e^{w_{1}X^{T}} X^{T}}{S}-\frac{\varepsilon}{3}\normalsize X^{T}}$.
  • ${\large\frac{\delta L_{l}}{\delta w_{2}}=\frac{e^{w_{2}X^{T}} X^{T}}{S}-\normalsize X^{T}+\large \frac {2\varepsilon X^{T}}{3}}$.
  • ${\large\frac{\delta L_{l}}{\delta w_{3}}=\frac{e^{w_{3}X^{T}} X^{T}}{S}-\frac{\varepsilon}{3}\normalsize X^{T}}$.

We know that ${\large\frac{\delta L_{l}}{\delta w_{1}}=\large\frac{\delta L_{l}}{\delta w_{2}}=\large\frac{\delta L_{l}}{\delta w_{3}}=0}$ at ${W^{\star}}$.

${\therefore \large\frac{e^{w_{1}^{\ast}X^{T}}X^{T}}{S}-\frac{\varepsilon X^{T}}{3}=0 \implies e^{w_{1}^{\ast}X^{T}}=\frac{S\large\varepsilon}{3}\implies \normalsize w_{1}^{\ast}X^{T}=\log \frac{\normalsize S\varepsilon}{3}}$ ${(Eq.3)}$. Similarly, ${\normalsize w_{3}^{\ast}X^{T}}=\large\log \frac{S\varepsilon}{3}$ ${(Eq.4)}$. In case of ${\large\frac{\delta L_{l}}{\delta w_{2}}}$, ${\large\frac {e^{w_{2}^{\ast}X^{T}}X^{T}}{S}-\normalsize X^{T}+\large\frac {2\varepsilon X^{T}}{3} =0\implies \large e^{w_{2}^{\ast}X^{T}}= \normalsize S(1-\frac{2\varepsilon}{3})}$ ${(Eq.5)}$.

To interpret ${Eq.5}$ let's put in the value of ${\varepsilon}$. Generally, ${\varepsilon}$ is taken as 0.1. Putting that in ${Eq.5}$, ${\large e^{w_{2}^{\ast}X^{T}}= \normalsize S(\frac{2.8}{3})=\normalsize0.93(e^{w_{1}^{\ast}X^{T}}+e^{w_{2}^{\ast}X^{T}}+e^{w_{3}^{\ast}X^{T}})\implies 0.07(e^{w_{2}^{\ast}X^{T}})=e^{w_{1}^{\ast}X^{T}}+e^{w_{3}^{\ast}X^{T}}}$. This shows that ${w_{2}^{\ast}X^{T}}$ is still large compared to ${w_{1}^{\ast}X^{T}}$ and ${w_{3}^{\ast}X^{T}}$. But there's one thing different, ${Eq.3}$ and ${Eq.4}$ show that at optimal point ${w_{1}^{\ast}X^{T}}$ and ${w_{3}^{\ast}X^{T}}$ are not ${-\infty}$ but a finite quantity i.e ${\normalsize\log \frac{S\varepsilon}{3}}$. This shows that minimising ${L_{l}}$ doesn't decrease ${w_{1}^{\ast}X^{T}}$ and ${w_{3}^{\ast}X^{T}}$ without any bounds, but decreases them upto a certain point which is same for ${w_{1}^{\ast}X^{T}}$ and ${w_{3}^{\ast}X^{T}}$. We can see that ${X}$ is equidistant from both ${w_{1}^{\ast}}$ and ${w_{3}^{\ast}}$.

Geometrically, we can say that minimising ${L_{l}}$ decreases the distance between tempelate of correct class and penultimate layer's activation (${X}$), and also encourages ${X}$ to go far from tempelates of incorrect classes but also remain equidistant from them. In this case ${X}$ hates the templates of incorrect classes but not as much as the ${X}$ of the previous subsection. Also, it hates all the incorrect class templates equally and tries to remain equidistant from them. The ${X}$ and template of correct class in this section love each other but not as strongly as those of previous section. (Apologies for the cheesy interpretation)

So this is where ${L_{l}}$ is different from ${L_{h}}$.

I hope now you can make sense of the main statement of the paper by Rafael Müller et.al which I quote verbatim: "label smoothing encourages the activations of the penultimate layer to be close to the template of the correct class and equally distant to the templates of the incorrect classes."

6. Okay, So How Does It Help My Model?

(In this section ${Xi}$ will denote the penultimate layer's activation when an image belonging to class ${i}$ is input in the model.)

  • ${L_{h}}$ = Loss calculated with one-hot encoded target vectors
  • ${L_{l}}$ = Loss calculated with target vectors with smooth labels

Let's go with the above scenario that we have a task of image classifiaction where a given image can belong to 3 classes. Suppose that class1 and class2 among these are symantically very similar (for.eg toy poodle and miniature poodle class of ImageNet). This means that if you input an image belonging to class1 and another image belonging to class2, their penultimate layer's activation can be very similar. Now, you prepare your dataset but you unfortunately forget to shuffle it randomly and so, all the images belonging to class1 are placed before all the images of class2 in the dataset. You begin training on this dataset by using a suitable batch size and using the loss ${L_{h}}$. By the time a batch of images belonging to class2 goes inside your model, the model has already been partially tuned by class1 images. Since you are training with ${L_{h}}$ loss, the ${X1}$ which derive from images of class1 have been dragged extremely far away from template of class2 and class3. Now, a batch of images belonging to class2 goes inside the model. Since class2 is symantically similar to class1, images belonging to this class have their penultimate layer's activation ${X2}$ very similar to ${X1}$. Because of this, these images will show a very strong affinity for class1 and despise being predicted that they belong to class2 because ${w_{1}(X2)^{T}}$ will be high for these images and at same time ${w_{2}(X2)^{T}}$ will be low. This will incur large value of loss which is bad for model. To remedy this, the model will need to take large steps and will take longer time to reduce the huge loss value. Instead, if we had trained on ${L_{l}}$, our model wouldn't have to work as hard to adapt since model itself is not entirely confident that if penultimate layer's activation are similar to ${X1}$ then the label is class1. Since ${w_{2}}$ is not dragged too far away from ${X1}$ and thus consequently ${X2}$ , the loss wouldn't be as high as in the previous case and the model will have the ability to adapt quickly. (Maybe this example also shows the importance of randomly shuffling your data).

Another advantage is incurred in classification. Suppose that this time learning from our previous mistake we shuffled the data randomly but trained the model with loss ${L_{h}}$. Since images belonging to class1 and class2 are very similar, both ${X1}$ and ${X2}$ are close to both ${w_{1}}$ and ${w_{2}}$. Also both (${X1}$) and (${X2}$) are far away from ${w_{3}}$. Although this model will accurately differentiate between class1 and class3 or class2 and class3,it may also sometimes misclassify images, if the image belonging to class1 or class2 is fed into the model. Because their ${X's}$ are so similar, model may assign class2 to an image belonging to class1 or vice versa. But instead if we train with ${L_{l}}$ loss, ${X1}$ will be equidistant from ${w_{2}}$ and ${w_{3}}$. Similarly, ${X2}$ will be equidistant from ${w_{1}}$ and ${w_{3}}$. Then, if we put an image belonging to class2 and it produces ${X2}$ from it's penultimate layer, model will correctly assign it class2 and will not confuse it with class1. (If distance of ${X2}$ from ${w_{1}}$ and ${w_{3}}$ is sufficiently large).

Sometimes, during our data labelling process, some images may get incorrect labels due to human error or other factors. In that case you don't want the penultimate layer activations of your images to cling too tightly to template of incorrectly labelled class, which would inevitably happen if you use ${L_{h}}$ loss. To make the modul robust against these incorrect labels, Label smoothing can come in handy because it decreases the model's confidence in it's incorrect ground-truth labels and doesn't let the ${X's}$ of images get too close to their incorrect label templates. Even though ${X}$ will get close to the tempelate of it's incorrect labels, it would be easier to modify if model is trained with ${L_{l}}$ instead of ${L_{h}}$.

7. Conclusion

We may conclude that if our dataset has symantically different classes and is correctly labelled (for.eg Imagenette dataset by fastai), then our normal loss function may perform well. But if it has symantically similar classes (for.eg Imagewoof dataset by fastai) or has incorrect labels, then you may want to use Label Smoothing.(Also, don't forget to randomly shuffle your data ;).

If you notice a mistake in this blog post please mention them in the comment section or email them to me at iamabhimanyu08@gmail.com, I'll make sure to correct them right away.