A Shot of Numbers

Having recently looked the Single-Shot Detector (SSD) on the PACAL-VOC dataset, I use it on a different dataset to see what happens.

You can find the notebook I used for this post here.

The dataset

The Street View House Numbers (SVHN) is a dataset of images of house number plates and the location of the numbers that appear on them. The locations of the numbers are described by bounding boxes. This image from their website shows what the dataset looks like:

/ssd_svhn/svhn_examples.png — **FIGURE.** The Street View House Numbers (SVHN). Bounding boxes plotted on top of the images.

In PASCAL-VOC the objects that appear in the images are things like car, person, chair, TV, etc., whereas in SVHN, they are numbers, or digits, to be more precise. i.e. 0, 1, 2, …, 9. Can the SSD pick out these?

Loading the annotation

The annotation for the dataset comes in a .mat file, digitStruct.mat, that has a rather complicated structure. The \( (x, y) \) coordinates of the top-left and bottom-right corners of the bounding boxes can be extracted this way:

def get_bbox_label():
    '''
    Loads SVHN's train/digitStruct.mat file that includes the annotations.
    This will return bounding boxes described as (x, y, x + dx, y + dy).
    '''
    f = h5py.File(PATH/'train/digitStruct.mat')
    for i in range(f['digitStruct/name'].shape[0]):
        fname = ''.join(chr(v) for v in f[f['digitStruct/name'][i,0]][()].flatten())
        grp = f[f['digitStruct/bbox'][i,0]]
        tops = [v if isinstance(v, np.float64) else f[v][()][0,0] for v in grp['top'   ][:,0]]
        hgts = [v if isinstance(v, np.float64) else f[v][()][0,0] for v in grp['height'][:,0]]
        lefs = [v if isinstance(v, np.float64) else f[v][()][0,0] for v in grp['left'  ][:,0]]
        wths = [v if isinstance(v, np.float64) else f[v][()][0,0] for v in grp['width' ][:,0]]
        lbls = [v if isinstance(v, np.float64) else f[v][()][0,0] for v in grp['label' ][:,0]]
        bbs = [np.array([y, x, y + dy - 1, x + dx - 1]).astype(np.int64) 
               for y, dy, x, dx in zip(tops, hgts, lefs, wths)]
        annot = [(b, int(l)) for b, l in zip(bbs, lbls)]
        yield {'name':fname, 'annot':annot}

For example, a bounding box whose top-left corner is at \((x=5, y=6)\) and bottom-right corner at \((x=15,y=26)\) is described by np.array([6, 5, 26, 15]) in the above function.

Vanishing bounding boxes

One thing that can happen is that a ground-truth bounding box can go through data transforms like resizing and rotation and disppear, becoming np.array([0, 0, 0, 0]). It can be that the box is already very small, so when resizing the image to a smaller size, the box just disappears. A box that is originally very close to the edge of the image can also disappear after the image is rotated. For such cases the loss is not computed because it does not make sense.¹

Looking for smaller objects

By default, the SSD used here effectively applies a \(5\times 5\), \(2 \times 2\) and \(1 \times 1\) grid over the image, which means it can pick out objects as small as roughly \(1 / 25\) the size of the image. However, some of the digits that appear on the number plates appear really quite small, so I also take output from the fourth last layer, in addition to the last three layers. The head of the SSD model then becomes:

class SSD_MultiHead(nn.Module):
    def __init__(self, k, bias):
        super().__init__()
        self.drop = nn.Dropout(drop)
        self.sconv0 = StdConv(512, 256, stride=1, drop=drop)
        self.sconv1 = StdConv(256, 256, drop=drop)
        self.sconv2 = StdConv(256, 256, drop=drop)
        self.sconv3 = StdConv(256, 256, drop=drop)
        self.out0 = OutConv(k, 256, bias)
        self.out1 = OutConv(k, 256, bias)
        self.out2 = OutConv(k, 256, bias)
        self.out3 = OutConv(k, 256, bias)
    
    def forward(self, x):
        x = self.drop(F.relu(x))
        x = self.sconv0(x)
        o0c, o0l = self.out0(x)
        x = self.sconv1(x)
        o1c, o1l = self.out1(x)
        x = self.sconv2(x)
        o2c, o2l = self.out2(x)
        x = self.sconv3(x)
        o3c, o3l = self.out3(x)
        return [torch.cat([o0c, o1c, o2c, o3c], dim=1),
                torch.cat([o0l, o1l, o2l, o3l], dim=1)]

Here, the number in the variable name indicates which layer it is from. sconv0 for the fourth last layer; sconv1 for the third last layer, and so on. Working backwards, the fourth last layer provides a \( 7 \times 7 \) grid over the image, so in principal it can spot objects as small as \(1/49\) the size of the image.

Examples of predicted boxes

Here are the predicted bounding boxes on a selection of validation samples:

/ssd_svhn/prediction_samples.png — **FIGURE.** For each predicted bounding box, it is shown the box id, and following the colon are the predicted digit and its probability.

It looks like the SSD is detecting numbers, at least to some extent. It seems most digits that appear on house number plates are long and thin and really close to each other; you hardly ever see flat, pancake-like digits. Maybe, to reflect this, we can adjust the aspect ratios of the anchor boxes such that they have narrower widths, and it might improve the results.

There is also a technical reason. An image can have more than one ground-truth bounding box. If some of these are np.array([0, 0, 0, 0]) boxes, get_y actually filters them out. The problem is when all the boxes are filtered out, leaving [], which cannot be handled by some of the functions further down in ssd_1_loss, leading to an error during runtime.

SSD SVHN object-detection