Citation for the published version:  
 
Sáez-Trigueros, D., Meng, L., & Hartnett, M. (2018). Enhancing Convolutional Neural 
Networks for Face Recognition with Occlusion Maps and Batch Triplet Loss. Image and 
Vision Computing, 79, 99. DOI: 10.1016/j.imavis.2018.09.011  
 
  
Document Version:  Accepted Version 
 
This manuscript is made available under the CC-BY-NC-ND license 
https://creativecommons.org/licenses/by-nc-nd/4.0/ 
 
Link to the final published version available at the publisher:  
 
https://doi.org/10.1016/j.imavis.2018.09.011  
 
 
 
 
 
General rights 
Copyright© and Moral Rights for the publications made accessible on this site are retained by the 
individual authors and/or other copyright owners. 
Please check the manuscript for details of any other licences that may have been applied and it is a 
condition of accessing publications that users recognise and abide by the legal requirements 
associated with these rights. You may not engage in further distribution of the material for any 
profitmaking activities or any commercial gain. You may freely distribute both the url 
(http://uhra.herts.ac.uk/) and the content of this paper for research or private study, educational, or 
not-for-profit purposes without prior permission or charge. 
Take down policy 
If you believe that this document breaches copyright please contact us providing details, any such 
items will be temporarily removed from the repository pending investigation. 
Enquiries 
Please contact University of Hertfordshire Research & Scholarly Communications for any enquiries at 
rsc@herts.ac.uk 
Enhancing Convolutional Neural Networks for Face
Recognition with Occlusion Maps and Batch Triplet
Loss
Daniel Sa´ez Triguerosa,b, Li Menga,∗, Margaret Hartnettb
aSchool of Engineering and Technology, University of Hertfordshire, Hatfield AL10 9AB, 
UK
bGBG plc, London E14 9QD, UK
Abstract
Despite the recent success of convolutional neural networks for computer vi-
sion applications, unconstrained face recognition remains a challenge. In this
work, we make two contributions to the field. Firstly, we consider the problem
of face recognition with partial occlusions and show how current approaches
might suffer significant performance degradation when dealing with this kind
of face images. We propose a simple method to find out which parts of the
human face are more important to achieve a high recognition rate, and use that
information during training to force a convolutional neural network to learn
discriminative features from all the face regions more equally, including those
that typical approaches tend to pay less attention to. We test the accuracy of
the proposed method when dealing with real-life occlusions using the AR face
database. Secondly, we propose a novel loss function called batch triplet loss
that improves the performance of the triplet loss by adding an extra term to
the loss function to cause minimisation of the standard deviation of both posi-
tive and negative scores. We show consistent improvement in the Labeled Faces
in the Wild (LFW) benchmark by applying both proposed adjustments to the
convolutional neural network training.
∗Corresponding author
Email addresses:  l.1.meng@herts.ac.uk (Li Meng)
Preprint submitted to Elsevier March 19, 2018
Keywords: face recognition, convolutional neural networks, facial occlusions,
distance metric learning
1. Introduction
Deep learning models, in particular convolutional neural networks (CNNs),
have revolutionised many computer vision applications, including face recog-
nition. As recent benchmarks [1, 2, 3] show, most of the top performing face
recognition algorithms are based on CNNs. Even though these models need to5
be trained with hundreds of thousands of faces to achieve state-of-the-art accu-
racy, several large-scale face datasets [4, 5, 3] have recently been made publicly
available to facilitate this.
Most of the recent research in the field has focused on unconstrained face
recognition. CNN models have shown excellent performance on this task, as they10
are able to extract features that are robust to variations present in the training
data (if enough samples containing these variations are provided). Nonetheless,
in this work, we show how partial facial occlusions remain a problem for un-
constrained face recognition. This is because most databases used for training
do not present enough occluded faces for a CNN to learn how to deal with15
them. Common sources of occlusion include sunglasses, hats, scarves, hair, or
any object between the face and the camera. This is of particular relevance
to applications where the subjects are not expected to be co-operative (e.g. in
security applications). One way of overcoming this problem is to train CNN
models with datasets that contain more occluded faces. However, this task can20
be challenging because the main source of face images is usually the web, where
labelled faces with occlusions are less abundant.
Bearing this in mind, we propose a novel data augmentation approach for
generating occluded face images in a strategic manner. We use a technique
similar to the occlusion sensitivity experiment proposed in [6] to identify the25
face regions where a CNN extracts the most discriminative features from. In our
proposed method, the identified face regions are covered during training to force
2
a CNN to extract discriminative features from the non-occluded face regions
with the goal of reducing the model’s reliance on the identified face regions.
Our CNN models trained using this approach have demonstrated noticeable30
performance improvement on face images presenting real-life facial occlusions
in the AR face database [7].
CNN models for face recognition can be trained using different approaches.
One of them consists of treating the problem as a classification one, wherein each
identity in the training set corresponds to a class. After training, the model can35
be used to recognise faces that are not present in the training set by discarding
the classification layer and using the features of the previous layer as the face
representation. In the realm of deep learning, these features are commonly
referred to as bottleneck features. Following this first training stage, the model
can be further trained using other techniques to optimise the bottleneck features40
for the target application [4, 8]. Another common approach to learning face
representation is to directly learn bottleneck features by optimising a distance
metric between pairs of faces [9, 10] or triplets of faces [11].
Positive results have been demonstrated when combining these two tech-
niques, either by (i) jointly training with a classification loss and a distance45
metric loss [12]; or (ii) by first training with a classification loss and then fine-
tuning the CNN model with a distance metric loss [5, 13, 14]. In this work, we
adopt the latter approach and use the triplet loss to optimise bottleneck fea-
tures. The goal of the triplet loss is to separate positive scores (obtained when
comparing pairs of faces belonging to the same subject) from negative scores50
(obtained when comparing pairs of faces belonging to different subjects) by a
minimum margin. We argue that training with this loss function can lead to
undesired results. Thus, we propose a new loss function to alleviate this issue
by also minimising the standard deviation of both positive and negative scores.
Using the Labeled Faces in the Wild (LFW) benchmark, we show that the CNN55
models trained with our proposed loss function consistently outperform those
trained with the triplet loss function.
The remainder of this paper is organised as follows. Section 2 provides a
3
review of the related work, with a focus on deep learning approaches and face
recognition with occlusion. Section 3 details our CNN architecture, training60
procedure and (i) our method of improving recognition of partially occluded
faces; and (ii) our novel loss function. Section 4 describes our experimental
results, and our conclusions are presented in Section 5.
2. Related Work
One of the first successful applications of convolutional neural networks was65
handwritten character recognition [15]. Soon after, the first face recognition
algorithm that included a CNN was proposed in [16]. However, unlike [15],
their algorithm [16] was not entirely based on neural networks. Years later,
[9] proposed an end-to-end Siamese architecture trained with a contrastive loss
function to directly minimise the distance between pairs of faces from the same70
subject while increasing the distance between pairs of faces from different sub-
jects. These CNN-based face recognition models did not achieve groundbreaking
results, mainly due to the low capacity of the networks used and the relatively
small datasets available for training at the time. It was not until these models
were scaled up and trained with large amount of data [17] that CNNs became75
the state-of-the-art approach for face recognition.
In particular, Facebook’s DeepFace [18], one of the first CNN-based ap-
proaches for face recognition that used a high capacity model, achieved an ac-
curacy of 97.35% on the LFW benchmark, reducing the error of the previous
state-of-the-art by 27%. DeepFace used an effective 3D alignment algorithm80
to frontalise faces before feeding them to a CNN with several convolutional,
max-pooling and locally connected layers. The CNN was trained with a dataset
containing 4.4 million faces from 4,030 subjects. Concurrently, the DeepID sys-
tem [8] achieved similar results by concatenating the bottleneck features of 60
CNNs trained on different face crops and optimising the concatenated feature85
vector using the Joint Bayesian method proposed in [19]. More work by the
same authors [12] achieved further performance improvements by simultane-
4
ously training with a contrastive loss (similar to the one used in [9]) and a
classification loss. The authors claimed that the contrastive loss reduced intra-
personal variations and the classification loss increased inter-personal variations.90
The described system achieved an accuracy of 99.15% on the LFW benchmark
using a relatively small training set containing 202,599 face images of 10,177
identities.
As shown in [20], training data is one of the most important factors for in-
creasing the accuracy of CNN-based approaches. In particular, it was shown95
that a CNN model becomes more accurate as the number of different identi-
ties in the training set increases, provided that several samples per identity are
available. A good example is Google’s FaceNet [11], which used between 100
million and 200 million face images of about 8 million different people for train-
ing. A triplet loss function with a novel online triplet sampling strategy was100
used for training FaceNet, which achieved an accuracy of 99.63% on the LFW
benchmark. The triplet loss has been subsequently used to fine-tune CNNs pre-
trained with a classification loss with good results [5, 21]. Indeed, the triplet
loss has become one of the most popular training objectives for face verification
[11, 5, 21, 13, 14], and has been used in other image similarity tasks such as105
ranking images [22, 23, 24] and learning local image descriptors [25, 26]. Other
popular tricks to improve the performance of CNN-based face recognition in-
clude Joint Bayesian [4, 9, 19, 27] and building ensemble models trained on
different face crops [9, 19, 21, 27].
Recognition of faces with occlusions has been typically handled using two110
different types of methods, namely, (i) methods that extract local features from
the non-occluded regions or (ii) methods that attempt to reconstruct occluded
regions.
In the first type of methods, occluded regions are detected first and discarded
from the set of local regions used to represent a face. For example, Gabor wavelet115
features, PCA and SVM were used in [28] to detect the occluded regions and
LBP descriptors were used to match the non-occluded regions. In [29], eigen
decomposition was used to generate a reformed image which was subtracted
5
from the original occluded image to locate the occluded regions. Gabor wavelet
features and PCA were used to extract features from the non-occluded regions.120
The method in [30] proposed to extract histograms of Gabor-LBP features on
the entire image and then use SIFT keypoint matching to select which subregions
should be taken into consideration.
Among the methods that attempt to reconstruct occluded regions, the sparse
representation-based classification (SRC) proposed in [31] has received a lot of125
attention. This method attempts to represent an occluded test image by a lin-
ear combination of training images of the same class and an error term that
accounts for the occluded region. The class that gives the closest reconstruction
of the original image is considered the correct one. Several improvements to this
method have been proposed. For example, [32] extended SRC by using a Markov130
random field to model the prior assumption about the spatial continuity of the
occluded regions. In [33] it was proposed to weight each pixel in the image in-
dependently to achieve better reconstructed images. Another improvement [34]
proposed to use linear combinations of Gabor wavelet features instead of pixel
intensities, which increased the discrimination power of the face representation135
and reduced computational costs. The drawback of these methods is that the
reconstruction can only be achieved for images of the same class as the training
images.
Another method that has gained popularity in image reconstruction tasks
such as image denoising and image inpainting is the denoising autoencoder140
[35, 36]. The idea is to train a model to learn a mapping between corrupted and
clean images. Several approaches have used this idea to reconstruct occluded
face images. For example, a stacked sparse denoising autoencoder [36] with two
channels was proposed in [37] to discard noise activations in the encoder network
and achieve better image reconstructions. Another related method was proposed145
in [38]. They used a novel mapping-autoencoder for occlusion detection and an
iterative stacked denoising autoencoder for image reconstruction. More recently,
[39] proposed to use LSTM autoencoders with two channels to reconstruct faces
in the wild. In this method, one autoencoder channel reconstructs the image
6
and the other detects an occlusion mask that is used to replace the occluded150
region in the original image with the reconstructed pixels. The quality of the
final output was further enhanced by introducing an adversarial discriminator.
3. Proposed Methods
We use the CNN architecture proposed in [4], which has demonstrated the
ability to achieve high accuracy on the LFW benchmark while maintaining low155
computational complexity. This CNN architecture is similar to that used in [40]
but comprises only ten convolutional layers and one fully-connected layer. The
input to this CNN is a greyscale image of size 100× 100 pixels aligned using a
simple 2D affine transformation. More details about this CNN architecture can
be found in [4].160
As a first training stage, our method adopts the approach of training a
classifier wherein the CNN produces a vector of scores s for each class j, which
is passed to a softmax function to calculate the probability p of the correct class
y:
p =
esy∑
j e
sj
(1)
The total loss of the CNN is defined as the average cross-entropy loss for each165
training sample i:
L = −
N∑
i
log pi (2)
where N is the number of samples in a batch of training samples.
In order to use the trained CNN classification model to compare face images
that are not present in the training set, the classification layer (i.e. the layer
producing the scores s) is discarded and the features from the previous layer are170
used as bottleneck features. These bottleneck features can directly be used as
the feature vector representing a face or can be further optimised as described
in Section 3.2. We have adopted cosine similarity to compare pairs of feature
vectors to get a similarity score that indicates the likelihood of two face images
belonging to the same identity.175
7
We trained such a CNN classification model using the CASIA-WebFace
database [4]. This database contains 494,414 face images of 10,575 different
celebrities gathered from the Internet. We randomly selected 10% of the images
as validation images and used the rest as training images. We consider this
CNN model as the baseline for performance comparison in our work and refer180
to it henceforth as model A.
3.1. Occlusion Maps
As shown in [6], it is possible to use visualisation techniques to gain insight
into the behaviour of CNN models after they have been trained. To solve the
facial occlusion challenge, we are interested in identifying which face regions185
a CNN model relies on the most, as we want to avoid this reliance. Using a
classification model, one way of visualising these regions is by observing how a
correct class score fluctuates when different face regions are occluded. A similar
type of occlusion sensitivity experiment has been conducted in [6] in the context
of object recognition. In our case, by occluding a face image for which a CNN190
model predicts the correct class, we can generate a binary occlusion map OI
to indicate whether placing an occluder at a particular spatial location in the
image I would cause the model to predict an incorrect class. More formally, a
binary occlusion map OI is defined as follows:
OIi,j =
0, if yˆi,j = y1, otherwise (3)
where yˆi,j is the predicted class when the centre of an occluder is placed at the195
location (i, j) of the image I and y is the correct class for the image I.
Since we are using face images that are aligned, we can construct a generic
occlusion map O by simply averaging the binary occlusion maps of a set of face
images. Each value of an occlusion map Oi,j corresponds to the classification
error incurred by a model when an occluder is placed at the location (i, j) in all200
the images used to generate O. For convenience, we refer to face regions that
present high classification error as high effect regions (as these are the regions in
8
(a) (b) (c)
(d) (e) (f)
(g) (h) (i)
Figure 1: (a), (d), (g) Mean image occluded at a random location with an occluder of 20×20,
20× 40, and 40× 40 respectively. (b), (e), (h) Occlusion maps O20×20, O20×40, and O40×40
generated using model A and the corresponding occluders. The pixel intensity of the occlusion
maps represents the classification error rate when placing the occluder at each location. (c), (f),
(i) Masked mean image using the occlusion maps O20×20, O20×40, and O40×40 respectively.
which the model relies on the most). By contrast, we refer to face regions that
present low classification error as low effect regions. These high and low effect
regions correspond to the bright and dark areas in the occlusion maps shown in205
Figure 1 respectively.
Considering the 100×100 face images used as input to our model, we exper-
iment with occluders of three different sizes. In particular, we use (i) a square
occluder of 20 × 20 pixels that can cover small regions such as one eye, the
nose or the mouth as shown in Figure 1a; (ii) a rectangular occluder of 20× 40210
pixels that can cover wider regions such as both eyes simultaneously as shown
9
in Figure 1d; and (iii) a larger square occluder of 40× 40 pixels that can cover
several face regions simultaneously as shown in Figure 1g. We denote the occlu-
sion maps generated with the 20× 20, 20× 40 and 40× 40 occluders by O20×20,
O20×40 and O40×40 respectively. Figures 1b, 1e and 1h show an example of these215
occlusion maps generated with model A using 1,000 images from our validation
set.
According to Figures 1c, 1f and 1i, the central part of the face is one of
the highest effect regions. This might be due to the presence of non-frontal face
images in the training set. Since the central part of the face is typically visible in220
both frontal and non-frontal face images, the model learns more discriminative
features from this area compared to the outer parts of the face, which might
not be visible in non-frontal face images. Simply put, the model is trained with
fewer face images in which the outer parts of the face are visible, therefore, it
relies more heavily on the central part of the face. We can reverse this behaviour225
by training with more face images that present occlusions located in high effect
regions (central part of the face), as this will force the model to learn more
discriminative features from low effect regions (outer parts of the face).
One way of achieving this is by augmenting the training set with face images
that present occlusions located at random locations. To do this, during training230
we can generate occluded training images by overlaying the original training
images with a randomly located occluder. However, since we want to favour
occlusions in high effect regions, we propose to augment the training set with
face images that present occlusions located in high effect regions more frequently
than in low effect regions. For this reason, the location of the occluder is sampled235
from a probability distribution P generated by applying the softmax function
with the temperature parameter T to an occlusion map O:
Pi,j =
e
Oi,j
T∑
n,m e
On,m
T
(4)
With high temperatures, all locations have the same probability. With low
temperatures, locations in high effect regions are assigned a higher probability.
10
(a) (b) (c)
Figure 2: Example occluders used during training with different intensities, noise types and
noise levels. (a) Salt-and-pepper noise. (b) Speckle noise. (c) Gaussian noise.
As shown in Figure 2, we use occluders of random intensities (or random240
colours if we were dealing with colour images) that present different types of
random noise (salt-and-pepper, speckle and Gaussian noise). This is important
because if the face is always covered by the same type of occluder, the CNN
would only learn features that are robust against that particular type of oc-
clusion. For example, if a black patch is always used to occlude faces during245
training, the CNN model would perform well when the face is occluded by a
black patch, but not when it is occluded by a patch of a different intensity.
This training procedure produces two desired outcomes, namely, (i) the
training set is augmented with variations not present in the original data, and
(ii) the occluder has a regulariser effect, helping the CNN to learn features from250
all face regions equally. Both of these increase the generalisation capability of
the model and prevent overfitting. In Section 4 we provide experimental results,
with both occluded and non-occluded face images, to validate these claims.
3.2. Batch Triplet Loss
In order to make the bottleneck features generalise better to classes not255
present in the training set, we fine-tune model A using a triplet loss function.
This training objective is also used in other similar works [5, 21, 13, 14]. How-
ever, in this work, we fine-tune the bottleneck features directly instead of learn-
ing a linear projection from them. It could be argued that the CNN model could
be trained from scratch using a triplet loss function, as proposed in [11]. But,260
11
(a) (b)
Figure 3: Example triplet from the CASIA-WebFace dataset. (a) Before triplet training. (b)
After triplet training.
according to our experiments, training with softmax cross-entropy loss offers
faster convergence than training with a triplet loss when a reasonable number
of samples per class are available and the number of classes is not very large.
To form a triplet we need an anchor image, a positive image and a negative
image. The anchor and the positive images belong to the same class and the265
negative image belongs to a different class. Denoting the output vector of the
CNN model as z (in our setting this would be the bottleneck features), we can
represent the output features for a particular triplet i as (zai , z
p
i , z
n
i ), denoting
the output features for the anchor, positive and negative images respectively.
The goal of a triplet loss function is to make the distance between zai and z
n
i270
(i.e. images from different classes) larger than the distance between zai and z
p
i
(i.e. images from the same class) by at least a minimum margin α. Figure 3
shows a visual representation of a triplet before and after training. In this work,
we consider the following as the standard triplet loss function:
L =
N∑
i
max
(
0, ‖zai − zpi ‖22 − ‖zai − zni ‖22 + α
)
(5)
275 Alternative versions of the standard triplet loss function can be defined with 
distance metrics other than the squared Euclidean distance. For example, the dot 
product is used as the similarity measure in [13]. More generally, we can
12
0.00 0.43 0.64 1.00
(a)
0.00 0.38 0.68 1.00
(b)
Figure 4: (a) Distribution of positive and negative scores after training a CNN classification
model. (b) Distribution of positive and negative scores after fine-tuning the same CNN model
with the standard triplet loss. Observe how even though the triplet training has been able to
further separate the mean values of the two distributions, there is more overlapping between
them, causing more false positives and/or false negatives
write
L =
N∑
i
max (0, d(zai , z
p
i )− d(zai , zni ) + α) (6)
where d(x, y) is any function that gives a score indicating distance between two
280 feature vectors. As seen in Equation 6, only triplets that violate the margin
condition d(zai , z
p
i ) + α > d(z
a
i , z
n
i ) produce a loss greater than zero and there-
fore contribute to the model’s convergence. To increase the training efficiency,
we adopt the online triplet sampling strategy proposed in [11] to select such
triplets and only use them during training. Taking this into consideration, we
285 can rewrite Equation 6 as
L = µap − µan + α (7)
where µap and µan are the mean values of the distribution of positive and
negative scores respectively.
From Equation 7 we can see that the loss becomes zero whenever µan is
equal to µap plus the margin α. In other words, the triplet loss function tries
to separate the mean values of the distribution of positive scores µap from the290
mean value of the distribution of negative scores µan by a minimum margin α.
13
A problem with the standard triplet loss function is that, in general, sepa-
rating the mean values of the two score distributions does not ensure that the
model performs well in a verification task. In Figure 4 we show how a CNN
model that has been fine-tuned with the standard triplet loss function is able295
to further separate the mean values of the two score distributions but does not
produce a better accuracy. This is because there might be more overlapping be-
tween the two distributions, causing more false positives and/or false negatives.
A solution to this problem is to also minimise the standard deviation of each
score distribution. Our loss function is inspired by the concept of decidability,300
proposed in [41] as a way of measuring the achievable accuracy of a verifica-
tion system regardless of the selected threshold or operating point. A possible
measure of decidability is defined as follows [41]:
d =
|µap − µan|√
1
2
(
σ2ap + σ
2
an
) (8)
where σ2ap and σ
2
an are the variances of the distributions of positive and negative
scores respectively.305
Equation 8 implies that a higher decidability d is achieved by increasing the
difference between the mean values of the two score distributions while decreas-
ing both of their variances. Although it would be possible to use the inverse of
Equation 8 as our training objective, in practice, using the margin parameter α
leads to a better separation between the two score score distributions. For this310
reason, we construct our loss function by adding a new term to Equation 7 that
accounts for the variance in the two score distributions:
L = (1− β) (µap − µan + α) + β
(
σ2ap + σ
2
an
)
(9)
where β is a parameter that balances the contribution of the two terms. In
particular, at β = 1, the term that accounts for the difference between the
mean values of each score distribution vanishes and only the term that accounts315
for the variances of the score distributions has an effect. The opposite happens
when β = 0.
14
An advantage of adding this new term to the triplet loss function is that even if
a triplet does not violate the margin condition, the loss will usually be greater
320 than zero since the term that accounts for the variances of the score distributions
is non-zero. Even though this means that adopting an online triplet sampling
strategy is not strictly needed, in our experiments we noticed faster convergence
when using it. Concurrent to our work, a similar loss function has been proposed
in [26] to learn local image descriptors. However, [26] does not make use of online
triplet sampling.325
Note that the loss function in Equation 9 cannot be expressed as the average
loss for each training image since the variances need to be computed with more
than one sample. Ideally, we need to train using large enough batches of images
so that the variance estimation is more accurate. For this reason, we refer to this
form of triplet loss as batch triplet loss. In Section 4.2, we show the improved330
accuracy when using our loss function compared to the standard triplet loss
function.
4. Experiments
In this section, we provide experimental results for our two contributions.
In Section 4.1 we test different CNN models trained with occluded training335
images as described in Section 3.1. We use the CASIA-WebFace database [4] to
evaluate the performance on faces that present artificial occlusions and the AR
face database [7] to evaluate the performance on face images that present real-
life occlusions. In Section 4.2 we show our experimental results on the LFW [1]
benchmark using the CNN models evaluated in Section 4.1 and their fine-tuned340
versions using the standard triplet loss function and the proposed batch triplet
loss function.
4.1. Performance on Occluded Faces
In Section 3.1 we described a training procedure for increasing the CNN
model classification accuracy on occluded faces by using a probability distri-345
bution P to augment the training set with occluded training images. In this
15
section we will start by comparing the performance of two different training
schemes. The first training scheme comprises fine-tuning our baseline model A,
described in Section 3, with occluded training images generated by sampling the
occluder locations from a probability distribution P . The probability distribu-350
tion P is obtained by applying Equation 4 to an occlusion map O of a particular 
size. Each occlusion map O was generated with model A using a subset of 1,000 
images from the CASIA-WebFace validation set, as described in Section 3.1. 
By contrast, the second training scheme comprises fine-tuning model A with
occluded training images generated by sampling the occluder locations from355
a standard normal distribution. The goal of training with these two training 
schemes is to assess the benefits of training CNN models with images overlaid by 
strategically located occluders as opposed to randomly located occluders.
        We train several CNN models in this manner, one for each of the occluder
360 sizes shown in Figure 1. The temperature value T in Equation 4 was empirically
        set to 0.25, 0.4 and 0.6 with O20×20, O20×40 and O40×40 respectively. We add
        the size of the occluder used to generate the occluded training images to the
       name of each fine-tuned model. Additionally, if the model was trained following
      the second training scheme, an R is added to the model name. For example,
365      model A  fine-tuned  with  occluded  training  images  overlaid  by  an  occluder
of 20 × 20 pixels becomes A20×20 if the locations of the occluder are sampled
from P (first training scheme) and A20×20R if the locations of the occluder are
sampled from a standard normal distribution (second training scheme).
To compare the accuracy of these fine-tuned CNN models we generate occlu-
sion maps O with them (one for each of the occluder sizes). Since an occlusion370
map indicates the classification error incurred by a model at each spatial loca-
tion, we can easily calculate the mean classification accuracy as 1−∑i,j Oi, j.
Table 1 shows the mean classification accuracy and standard deviation for each
occlusion map generated with each fine-tuned model. For each model in Table 1,
we generated the three occlusion maps O20×20, O20×40 and O40×40 using a sub-375
set of 1,000 images from the CASIA-WebFace validation set in such a way that
all the selected images can be correctly classified by the model if no occluder is
16
Model O20×20 O20×40 O40×40
A 92.9%± 10.99 86.18%± 18.51 76.19%± 27.89
A20×20 97.69%± 2.62 95.1%± 5.64 88.9%± 14.13
A20×20R 97.12%± 3.55 93.98%± 7.39 86.93%± 16.98
A20×40 97.75%± 2.42 95.85%± 4.03 90.64%± 10.88
A20×40R 97.62%± 2.9 95.45%± 5.12 89.54%± 13.29
A40×40 98.37%± 1.7 96.8%± 3.16 93.47%± 6.94
A40×40R 98.31%± 2.29 96.52%± 4.14 92.61%± 9.13
Table 1: Mean classification accuracy and standard deviation of each occlusion map O gen-
erated by different CNN models.
used. For example, to generate O20×20, O20×40 and O40×40 with model A20×20, 
we used a subset of 1,000 images that were classified correctly by model A20×20.
380 To avoid any bias in the results, we selected a different subset of 1,000 images 
to generate the occlusion maps used to compute the results shown in Table 1 
and to generate the probability distribution P used when training each model.
In other words, we avoided testing our models using the same images that were 
(indirectly) incorporated in the training stage by the use of P . Observe that not
385 only all the models fine-tuned with occluded training images achieve a higher
classification accuracy than model A but their standard deviations are consid-
erably smaller. This indicates that the performance of the fine-tuned models is
much less affected by the location of the occluder, i.e. the models are able to
extract discriminant features from all the face regions more equally, regardless
of the location of occluder. Moreover, the results in Table 1 show the better390
performance of the models trained with occluded training images overlaid by 
strategically located occluders compared to those trained with occluded training 
images overlaid by randomly located occluders.
Since the goal is to improve the accuracy when dealing with real-life occlu-
sions, we have further evaluated the performance of our CNN models on the395
17
(a) (b) (c)
Figure 5: Example images from the AR database. In each subfigure, the highlighted image
on the top left is the reference image (target image) used to compare against the other three
images (query images). (a) Non-occluded. (b) Wearing sunglasses. (c) Wearing scarf.
AR face database [7]. The AR face database contains 4,000 face images of 126
different subjects with different facial expressions, illumination conditions and
occlusions. Out of these, we only use faces with different illumination conditions
and occlusions. The different illumination conditions correspond to face images
with a light on the left side, right side or both. The occluded face images consist400
of people wearing either sunglasses or a scarf. We carry out three different eval-
uations. In each evaluation, we compare non-occluded faces against (i) other
non-occluded faces, (ii) faces occluded by a pair of sunglasses, and (iii) faces
occluded by a scarf. Figure 5 shows examples of each type of image used in the
three evaluations.405
As shown by the resulting ROC curves in Figures 6a to 6c, the performance
of the models trained with occluded training images consistently outperform
the baseline model A, particularly at low False Acceptance Rates. Note that
the performance does not seem to be greatly affected by the occluder size. The
ROC curve for the evaluation of non-occluded faces (Figure 6a) shows that410
model A40×40 performs slightly worse than models A20×20 and A20×40, perhaps
because the large occluder used during training makes the model rely on fewer
features. As a consequence, the model performs worse when presented with
non-occluded faces in which all the face regions are visible and contain useful
18
features. In contrast, model A20×20 performs worse than the other two when415
presented with faces occluded by a pair of sunglasses (Figure 6b) or a scarf
(Figure 6c). This might be because the occluder used during training is too
small to simulate these types of occlusions.
Observe that, even though model A40×40 achieved the best classification ac-
curacy when evaluated on face images that present artificial occlusions (Table 1),420
the results on the AR face database differ because the evaluation involves com-
paring pairs of occluded and non-occluded face images instead of only classifying
occluded face images. For this reason, the models need to perform well not only
when presented with occluded face images but also with non-occluded face im-
ages. It seems that using a medium-sized occluder like the one used to train425
model A20×40 offers the best performance when taking into account the three
different evaluations, as it avoids the problems encountered with small occlud-
ers (not being able to simulate large occlusions like sunglasses and scarves) and
large occluders (worse performance when presented with non-occluded faces).
Note that we did not repeat these experiments with the models trained using430
the second training scheme described earlier, as their performance was already
shown to be inferior.
4.2. Performance on the LFW benchmark
We now adopt another approach to training by fine-tuning model A using
the standard triplet loss and the batch triplet loss described in Section 3.2. We435
do this by discarding the classification layer, normalising the features of the
previous layer (bottleneck features) using the L2-norm and fine-tuning all the
CNN layers with one of the two loss functions. We refer to the CNN model fine-
tuned with the standard triplet loss function as model B, and the CNN model
fine-tuned with the batch triplet loss function as model C. The parameter α is440
set to 0.5 when using any of these training objectives and the parameter β is set
to 0.7 when training with the batch triplet loss function. The values for both α
and β were obtained empirically.
Additionally, we also trained these CNN models with occluded training im-
19
ages. Similarly to the notation followed in Section 4.1, we append the size of the445
occluder used during training to the model name. In this case, we trained these
models by fine-tuning a model that has already been trained with occluded
training images instead of fine-tuning model A. The locations of the occlud-
ers were sampled from the same probability distributions P that were used in
Section 4.1. For example, to train model B20×40 we fine-tuned model A20×40450
(and not model A) with occluded training images overlayed by an occluder of
20 × 40 placed at locations sampled from the same probability distribution P
that was used to train A20×40.
Our models are evaluated on the LFW dataset following the unrestricted,
455      labelled outside data protocol [42] (i.e. the protocol that allows training with
data that is not part of the LFW dataset). The LFW protocol divides the test
set in ten splits. The classification accuracy on each test split is calculated
by counting the amount of matching and mismatching pairs given a certain
threshold (in our case, pairs that give a similarity score above the threshold
are counted as matching pairs and pairs that give a similarity score below the460
threshold are counted as mismatching pairs). For each test split, we selected the
threshold that gives the highest amount of correct classifications in the other
nine splits. The final reported value is the mean classification accuracy and the
standard deviation calculated from the ten test splits. Note that most of the
face images in the LFW dataset are not occluded, therefore, we do not expect465
to see a performance improvement as large as that seen in our experiments
with occluded faces in Section 4.1. Figure 7 shows examples of matching and
mismatching pairs of face images from the LFW benchmark.
As shown in Table 2, all the CNN models fine-tuned with the batch triplet
loss outperform the CNN models trained with the standard triplet loss, validat-470
ing the usefulness of our approach. Moreover, consistent with the results shown
in Figure 6, the CNN models trained with the 20 × 40 occluder are the best
performers.
20
Model Accuracy
A 97.33% ± 0.71
B 97.73% ± 0.76
C 98.12%± 0.65
A20×20 97.4% ± 0.71
B20×20 97.85% ± 0.69
C20×20 98.35%± 0.73
A20×40 97.68% ± 0.83
B20×40 97.79% ± 0.82
C20×40 98.42%± 0.68
A40×40 97.18% ± 0.63
B40×40 97.5% ± 0.57
C40×40 98.16%± 0.64
Table 2: Mean classification accuracy and standard deviation of different CNN models evalu-
ated following the LFW unrestricted, labelled outside data protocol.
21
10−4 10−3 10−2 10−1
False Acceptance Rate
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
T
ru
e
A
cc
ep
ta
n
ce
R
at
e
A
A20×20
A20×40
A40×40
(a)
10−4 10−3 10−2 10−1
False Acceptance Rate
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
T
ru
e
A
cc
ep
ta
n
ce
R
at
e
A
A20×20
A20×40
A40×40
(b)
10−4 10−3 10−2 10−1
False Acceptance Rate
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
T
ru
e
A
cc
ep
ta
n
ce
R
at
e
A
A20×20
A20×40
A40×40
(c)
Figure 6: AR database ROC curves (a) Non-occluded. (b) Wearing sunglasses. (c) Wearing
scarf.
22
(a) (b)
Figure 7: Example pairs from the LFW benchmark. (a) Matching pairs. (b) Mismatching
pairs.
23
5. Conclusions
We have investigated which parts of the human face have the highest impact475
on face recognition accuracy. The proposed occlusion maps are a good way of
visualising these regions and, at the same time, provide useful information about
a classification model’s performance on faces that present artificial occlusions.
According to our experimental results, even a state-of-the-art CNN-based face
recognition model fails to maintain its high performance when these face regions480
are occluded (e.g. by a pair of sunglasses or a scarf). We have demonstrated how
these occlusion maps can be used during the training procedure to augment the
training set with face images that present artificial occlusions. These artificial
occlusions are strategically positioned in locations where the performance of
a CNN model trained in a conventional way is most sensitive. Training with485
these augmented training sets, we produce CNN models that are more robust to
face occlusions. As shown in our experimental results, our proposed method has
shown consistent performance improvement on face images that present artificial
or real-life occlusions and on face images that do not present any occlusions.
Additionally, we have revisited the problem of learning features for a ver-490
ification task using distance metric objectives. We have extended the widely
used triplet loss function by adding a new term that minimises the standard
deviation of the distributions of positive and negative scores. In our experi-
ments on the LFW benchmark, the proposed batch triplet loss has consistently
achieved better results than the standard triplet loss. Finally, experimental re-495
500
sults have confirmed that the best CNN models result from a combination of our 
two proposed approaches, regardless of whether the face images are occluded or 
not.
Acknowledgments
This work resulted from a collaborative research project between University 
of Hertfordshire and IDscan Biometrics Ltd (now part of GBG plc) as part of a 
Knowl-edge Transfer Partnership (KTP) programme supported by Innovate UK  
(partnership number: 009547).
24
References
[1] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled faces
in the wild: A database for studying face recognition in unconstrained envi-505
ronments,” tech. rep., Technical Report 07-49, University of Massachusetts,
Amherst, 2007.
[2] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and E. Brossard, “The
megaface benchmark: 1 million faces for recognition at scale,” in Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition,510
pp. 4873–4882, 2016.
[3] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “Ms-celeb-1m: A dataset
and benchmark for large-scale face recognition,” in European Conference
on Computer Vision, pp. 87–102, Springer, 2016.
[4] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation from515
scratch,” arXiv preprint arXiv:1411.7923, 2014.
[5] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in
British Machine Vision Conference, vol. 1, p. 6, 2015.
[6] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional
networks,” in Computer Vision-ECCV 2014, pp. 818–833, Springer, 2014.520
[7] A. M. Martinez, “The ar face database,” CVC Technical Report, vol. 24,
1998.
[8] Y. Sun, X. Wang, and X. Tang, “Deep learning face representation from
predicting 10,000 classes,” in Computer Vision and Pattern Recognition
(CVPR), 2014 IEEE Conference on, pp. 1891–1898, IEEE, 2014.525
[9] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discrim-
inatively, with application to face verification,” in 2005 IEEE Computer So-
ciety Conference on Computer Vision and Pattern Recognition (CVPR’05),
vol. 1, pp. 539–546, IEEE, 2005.
25
[10] H. Fan, Z. Cao, Y. Jiang, Q. Yin, and C. Doudou, “Learning deep face530
representation,” arXiv preprint arXiv:1403.2802, 2014.
[11] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding
for face recognition and clustering,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 815–823, 2015.
[12] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face representa-535
tion by joint identification-verification,” in Advances in Neural Information
Processing Systems, pp. 1988–1996, 2014.
[13] S. Sankaranarayanan, A. Alavi, and R. Chellappa, “Triplet similarity em-
bedding for face verification,” arXiv:1602.03418 [cs], 2016.
[14] S. Sankaranarayanan, A. Alavi, C. Castillo, and R. Chellappa, “Triplet540
probabilistic embedding for face verification and clustering,” arXiv preprint
arXiv:1604.05417, 2016.
[15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11,
pp. 2278–2324, 1998.545
[16] S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back, “Face recognition:
A convolutional neural-network approach,” Neural Networks, IEEE Trans-
actions on, vol. 8, no. 1, pp. 98–113, 1997.
[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with
deep convolutional neural networks,” in Advances in neural information550
processing systems, pp. 1097–1105, 2012.
[18] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the
gap to human-level performance in face verification,” in IEEE Conf. Com-
puter Vision and Pattern Recognition (CVPR), pp. 1701–1708, IEEE, 2014.
[19] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun, “Bayesian face revisited:555
A joint formulation,” Computer Vision–ECCV 2012, pp. 566–579, 2012.
26
[20] E. Zhou, Z. Cao, and Q. Yin, “Naive-deep face recognition: Touching the
limit of lfw benchmark or not?,” arXiv preprint arXiv:1501.04690, 2015.
[21] J. Liu, Y. Deng, T. Bai, Z. Wei, and C. Huang, “Targeting ultimate accu-
racy: Face recognition via deep embedding,” arXiv:1506.07310 [cs], 2015.560
[22] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen,
and Y. Wu, “Learning fine-grained image similarity with deep ranking,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 1386–1393, 2014.
[23] E. Hoffer and N. Ailon, “Deep metric learning using triplet network,” in565
International Workshop on Similarity-Based Pattern Recognition, pp. 84–
92, Springer, 2015.
[24] A. Gordo, J. Almaza´n, J. Revaud, and D. Larlus, “Deep image retrieval:
Learning global representations for image search,” in European Conference
on Computer Vision, pp. 241–257, Springer, 2016.570
[25] P. Wohlhart and V. Lepetit, “Learning descriptors for object recognition
and 3d pose estimation,” in Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pp. 3109–3118, 2015.
[26] B. Kumar, G. Carneiro, I. Reid, et al., “Learning local image descriptors
with deep siamese and triplet convolutional networks by minimising global575
loss functions,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 5385–5394, 2016.
[27] C. Ding and D. Tao, “Robust face recognition via multimodal deep face rep-
resentation,” IEEE Transactions on Multimedia, vol. 17, no. 11, pp. 2049–
2058, 2015.580
[28] R. Min, A. Hadid, and J. L. Dugelay, “Improving the recognition of faces
occluded by facial accessories,” in 2011 IEEE International Conference on
Automatic Face Gesture Recognition and Workshops (FG 2011), pp. 442–
447, 2011.
27
[29] M. Sharma, S. Prakash, and P. Gupta, “An efficient partial occluded face585
recognition system,” Neurocomputing, vol. 116, pp. 231–241, 2013.
[30] S. Park, H. Lee, J.-H. Yoo, G. Kim, and S. Kim, “Partially occluded facial
image retrieval based on a similarity measurement,” Mathematical Prob-
lems in Engineering, vol. 2015, p. e217568, 2015.
[31] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust590
face recognition via sparse representation,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210–227, 2009.
[32] Z. Zhou, A. Wagner, H. Mobahi, J. Wright, and Y. Ma, “Face recogni-
tion with contiguous occlusion using markov random fields.,” in ICCV,
pp. 1050–1057, 2009.595
[33] H. Jia and A. M. Martinez, “Face recognition with occlusions in the training
and testing sets,” in Automatic Face & Gesture Recognition, 2008. FG’08.
8th IEEE International Conference on, pp. 1–6, IEEE, 2008.
[34] M. Yang, L. Zhang, S. C. K. Shiu, and D. Zhang, “Gabor feature based
robust representation and classification for face recognition with gabor oc-600
clusion dictionary,” Pattern Recognition, vol. 46, no. 7, pp. 1865–1878,
2013.
[35] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol,
“Stacked denoising autoencoders: Learning useful representations in a deep
network with a local denoising criterion,” Journal of Machine Learning Re-605
search, vol. 11, no. Dec, pp. 3371–3408, 2010.
[36] J. Xie, L. Xu, and E. Chen, “Image denoising and inpainting with deep
neural networks,” in Advances in Neural Information Processing Systems
25 (F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, eds.),
pp. 341–349, Curran Associates, Inc., 2012.610
[37] L. Cheng, J. Wang, Y. Gong, and Q. Hou, “Robust deep auto-encoder for
occluded face recognition,” in Proceedings of the 23rd ACM International
28
Conference on Multimedia, MM ’15, (New York, NY, USA), pp. 1099–1102,
ACM, 2015.
[38] Y. Zhang, R. Liu, S. Zhang, and M. Zhu, “Occlusion-robust face recognition615
using iterative stacked denoising autoencoder,” in International Conference
on Neural Information Processing, pp. 352–359, Springer, 2013.
[39] F. Zhao, J. Feng, J. Zhao, W. Yang, and S. Yan, “Robust lstm-autoencoders
for face de-occlusion in the wild,” arXiv preprint arXiv:1612.08534, 2016.
[40] K. Simonyan and A. Zisserman, “Very deep convolutional networks for620
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[41] J. Daugman, Biometric decision landscapes. No. 482, University of Cam-
bridge, Computer Laboratory, 2000.
[42] G. B. Huang and E. Learned-Miller, “Labeled faces in the wild: Updates
and new reporting procedures,” Dept. Comput. Sci., Univ. Massachusetts625
Amherst, Amherst, MA, USA, Tech. Rep, pp. 14–003, 2014.
29