Grad-CAM in Deepfake Face Recognition

Intro

Class activation mapping (CAM) can be thought of as an interpretation tool for convolutional neural networks. It visually explains the prediction made by the neural network by highlighting certain areas in the input image. These areas are said to have been looked at most 'closely' by the neural network in making its prediction.

In object recognition, if a model predicts 'dog' from an input image containing a person walking his dog, you expect the CAM to highlight the area centred around the dog. If the model predicts from the same image the caption 'Man walking dog'. In this case it would make sense for both the person and the dog to be highlighted in the image.

Deepfakes

Deepfakes are these realistic-looking images or videos where a person's face has actually been artificially manipulated. There are various techniques for generating deepfakes, but in general the result is that the original face is subtly altered. What is it, then, that a neural network 'looks at' to decide that an image of a face is a deepfake? We apply a pretrained deepfake classifier to several deepfake videos and construct CAMs to see if they give an indication as to what the classifier looks at in an image to be able to tell that it's a deepfake. Along the way it's shown how Pytorch hooks are used in general to extract information from inside a model.

FAKE/REAL classifier

The model that will be used here is an EfficientNet-b1 model that has already been trained to classify an image of a face as either FAKE or REAL. It's not the best performing model out there; it forms the basis of a solution that has a logloss score of about 0.51980 on the private leaderboard of Kaggle's Deepfake Detection Challenge. (1st-place solution's score is 0.42320.)

Face detector

The classifier works on images of faces, so crops of faces need to be made out of the videos to be looked at. For this, Pytorch RetinaFace is used to recognise faces in the video frames. Despite being relatively slow, it's a very accurate face recognition system.

Data

The Kaggle Deepfake Detection challenge provides a good set of deepfake videos to work with. Each video is about 10 seconds long, and what's very useful is that the metadata shows which other video is the original version of the deepfake, so this is useful for frame-by-frame comparison. Here, several deepfake videos from the train_example_videos set of this dataset are selected, and their CAM will be looked at.

Class Activation Mapping

A CAM is supposed give an indication of what the model thinks as important in the image when making certain predictions, or more precisely in the current context, what it regards as important in predicting the score for an image being FAKE. As such, there are two main ingredients to consider in constructing a CAM:

The output activation of a convolutional layer
The output score for a selected class (FAKE in current example).

That the CAM should be constructed from the output activation of a convolutional layer in the model can perhaps be justified by the fact that these have spatial correspondence with the input image, and that activation in general reflects the amount of excitation, and hence 'importance', due to certain pixels (receptive field) in the image. However, the output activation alone isn't enough, because it's computed from the input image through the forward propagation; there is nothing about it that has come from the score of a predicted class of interest. To incorporate this, the gradient of the score on the output activation is used, too.

Here is how the output activation and the gradient of the output score on it are combined to form the CAM. Suppose $A_{ij}^k$ is the output activation of one of the model's convolutional layers, where $i$ and $j$ are the indices for the spatial (width and height) dimensions, and $k$ is the index for the channel dimension. And let $y^{c}$ be the model's output score for the image being FAKE (this needs not be a probability, could be a score just before a softmax). First, the gradient of $y^c$ on $A^k_{ij}$ is averaged over the spatial dimensions (equivalent to applying global average pooling): $$ \alpha^{c}_{k} = \frac{1}{Z}\sum_{i}\sum_{j}\frac{\partial{y^c}}{\partial{A^k_{ij}}} $$ Since $\frac{\partial{y^c}}{\partial{A^k_{ij}}}$ measures how fast $y^c$ changes with $A^k_{ij}$, $\alpha^{c}_{k}$ is a measure of the importance of channel $k$ for output class $c$, and so it makes sense to use $\alpha$ in a weighted average to reduce the channel dimension of $A^k_{ij}$. This is essentially the CAM: $$ L^c_{ij} = ReLU\left(\sum_{k} \alpha^c_k \: A^k_{ij}\right) $$ Notice that a $ReLU$ is applied to the weighted average over channels. In the literature, technically, the above is called the gradient-weighted class activation mapping (Grad-CAM). In the special case where $A$ is from the last convolutional layer, and after which there is only one linear layer in the model, it's just called CAM. In principal, the above expression can be applied to any convolutional layer in the neural network to get the Grad-CAM.

Hooks

$A^{k}_{ij}$, an output activation, is computed during the forward propagation, and $\frac{\partial{y^c}}{A^{k}_{ij}}$, a gradient, is computed during the backward propagation. The way to get at these quantities in Pytorch is to use hooks. Once a layer whose output activation is to be considered for the CAM is chosen, two hooks are attached to it: one to get $A^{k}_{ij}$ during forward propagation, and the other to get $\frac{\partial{y^c}}{A^{k}_{ij}}$ during backward propagation.

Below is an implementation that allows the creation of such hooks and the storing of the activation and the gradient.

class FwdHook():
    def __init__(self, m):
        self.hook_handle = m.register_forward_hook(self.hook_func)
    def hook_func(self, m, i, o): self.stored = o.detach().clone()
    def __enter__(self, *args): return self
    def __exit__(self, *args): self.hook_handle.remove()

class BwdHook():
    def __init__(self, m):
        self.hook_handle = m.register_backward_hook(self.hook_func)
    def hook_func(self, m, ig, og): self.stored = og[0].detach().clone()
    def __enter__(self, *args): return self
    def __exit__(self, *args): self.hook_handle.remove()

This implementation is taken from The Fastbook by Jeremy Howard and Sylvain Gugger. FwdHook is used for forward propagation, and BwdHook is used for backward propagation. Central to these are the register_forward_hook and register_backward_hook methods of nn.Module objects. These allow $A_{ij}^k$ and $\frac{\partial{y^c}}{\partial{A_{ij}^k}}$ to be accessed during forward and backward propagation, respectively. FwdHook and BwdHook also allows the accessed activation and gradient to be stored in them (as the stored attribute). They're also context managers (as defined by their __enter__ and __exit__ methods), so will safely disconnect the hooks once the things that are needed have been stored away. FwdHook and BwdHook are all but equal, except that one has forward_hook and the other has backward_hook. One other minor difference is that the gradient on the output activation, og, is a tuple, from which the first element is taken.

So now, we can attach the hooks to a selected layer of the model, pass an image through the model (forward propagation), backpropagate from the output score for the FAKE class. These steps are summed up in the following function:

def classify_face(im, cam_module):
    xb = im[None,...] / 255
    xb = xb.sub_(mean_norm[None,:,None,None]).div_(std_norm[None,:,None,None])
    xb = xb.to(device)
    model.eval()
    cls = 0
    with BwdHook(cam_module) as bwdhook:
        with FwdHook(cam_module) as fwdhook:
            output = model(xb)
            act = fwdhook.stored[0]
        output[0,cls].backward()
        grad = bwdhook.stored[0]

    gradcam = (grad.mean(dim=(1,2))[:,None,None] * act).sum(dim=0).relu_()
    with torch.no_grad(): prob = F.softmax(output[0], dim=-1)[0]
    return prob, gradcam

im is the tensor of the image of interest. cam_module is the layer (or module) of interest. The forward propagation's hook is nested inside the back propagation's because forward propagation happens first, so we will get the output activation from that, remove the forward hook, before doing the back propagation, getting the gradient, and then removing the backward book. The backward propagation is done by calling the backward method of the output score for the FAKE class, $y^{0}$ (here $0$ is for FAKE, and $1$ is for REAL). It's important that the thing from which backward is called is a scalar.

Like what normally happens during inference, a single image is processed at a time, and the model is set to eval mode. However, a difference is that, here, things are not done under the torch.no_grad() context, because the gradient is needed.

After the needed quantities, $A_{ij}^k$ and $\frac{\partial{y^c}}{\partial{A_{ij}^k}}$, are obtained, and the hooks removed, the Grad-CAM is put together, following the description above: gradcam = (grad.mean(dim=(1,2))[:,None,None] * act).sum(dim=0)

Example videos

Now let's just look at the CAM of several example videos. In each example, an original video (without deepfake manipulation) and its deepfake version are used. Five frames are randomly selected from both videos, so they align, and the face is cropped from each frame. The top row in the figure shows the original faces. The second row shows the corresponding deepfake faces, and the third row shows the deepfake faces with the CAM overlaid. The CAM is plotted here with the magma colour map, so the more yellowish it is, the higher its value. In the top-right corner of the deepfake faces, the number shown is the probability with which the model thinks the face is a deepfake.

/dfdc_cam/seated_woman_0.png — **FIGURE 1.** (*Very badly faked glasses.*) The middle part of the glasses appears to be missing in all the deepfake images. Overall though the deepfaked faces appear quite realistic, without any obvious defects. The area near the eyes is highlighted a lot more than the rest. Perhaps this is due to the unrealistic glasses.

/dfdc_cam/seated_woman_1.png — **FIGURE 2.** (*Slightly better faked glasses.*) This example is based on the same original video as above. The glasses in the deepfake images appear to be better rendered. The highlighted area appears slightly more spread out and less concentrated near the eyes, especially in frame 2 and 3, but the probability is lower at the same time.

/dfdc_cam/seated_woman_2.png — **FIGURE 3.** (*Fake moustache*). The deepfaked face has a moustache on it, which is obviously fake. But it doesn't appear to be much highlighted. The eyes/glasses are still the most highlighted bits of the image.

/dfdc_cam/man_onsofa_0.png — **FIGURE 4.** (*Man on sofa.*) The man is not wearing glasses like in the preceeding examples. The highlighted area is visibly more spread out over the face.

/dfdc_cam/woman_ongreen_0.png — **FIGURE 5.** *(Woman against green backdrop 0)*. The quality of the deepfake faces in the video appears to be very good. They're realistic and have no obvious defects. The highlighted areas are also more spread out here, not just near the eyes, though less visible when the head is tilted, or when the woman smiles.

/dfdc_cam/woman_ongreen_1.png — **FIGURE 6.** (*Woman against green backdrop 1)*. This video has dim lighting. The deepfake faces not only appear realistic, they are also close to the originals. It's clear that the model is less confident here. Nevertheless, one can still see some faint highlights.

In the next two, final examples, the CAMs plotted don't have the ReLU applied, so they appear brighter in general, but one should still be able to make out where the highlights are. Also, only the deepfake images with CAM overlaid on top are shown.

/dfdc_cam/frame_twomen_tshirt.png — **FIGURE 7.** *(Video frame: Two men, one t-shirt.*)

/dfdc_cam/twomen_tshirt_0.png — **FIGURE 8.** *(Two men, one t-shirt.*) Notice that when the model is most confident, it's actually focussing on an area that falls just outside the person's face. In general, the CAM appears darker (smaller in value) on the faces which are unlikely (low probability) to be deepfakes.

/dfdc_cam/frame_woman_portraits.png — **FIGURE 9.** *(Video frame: Woman in front of two portrait paintings.*)

/dfdc_cam/woman_portraits_0.png — **FIGURE 10.** (*Woman in front of two portrait paintings.*) Here, only frame 5 is likely to be a deepfake; highlight near the eyes. In contrast, in other frames' which are less likely to be deepfakes the highlights appear to be around the face.