Optimizing the F-measure for Threshold-free Salient Object Detection
1 College of computer science, Nankai University, Tianjin, China.
2 Inception Institude of Artificial Intelligence (IIAI), Abu Dhabi, UAE.
Kai Zhao is looking for a postdoctoral position in Deep Learning and Computer Vision.
Current CNN-based solutions to salient object detection (SOD) mainly rely on the optimization of cross-entropy loss (CELoss). Then the quality of detected saliency maps is often evaluated in terms of F-measure. In this paper, we investigate an interesting issue: can we consistently use the F-measure formulation in both training and evaluation for SOD? By reformulating the standard F-measure we propose the relaxed F-measure which is differentiable w.r.t the posterior and can be easily appended to the back of CNNs as the loss function. Compared to the conventional cross-entropy loss of which the gradients decrease dramatically in the saturated area, our loss function, named FLoss, holds considerable gradients even when the activation approaches the target. Consequently, the FLoss can continuously force the network to produce polarized activations. Comprehensive benchmarks on several popular datasets show that FLoss outperforms the state-of-the-art with a considerable margin. More specifically, due to the polarized predictions, our method is able to obtain high-quality saliency maps without carefully tuning the optimal threshold, showing significant advantages in real-world applications.
In the standard F-measure, the true positive, false positive and false negative are defined as the number of corresponding samples:
$$ \begin{equation} \begin{split} TP(\dot{Y}^t, Y) &= \sum\nolimits_i 1(y_i==1 \ \text{and} \ \dot{y}^t_i==1) \\ FP(\dot{Y}^t, Y) &= \sum\nolimits_i 1(y_i==0 \ \text{and} \ \dot{y}^t_i==1) \\ FN(\dot{Y}^t, Y) &= \sum\nolimits_i 1(y_i==1 \ \text{and} \ \dot{y}^t_i==0), \end{split} \label{eq:tpfp0} \end{equation} $$where $Y$ is the ground-truth, $\dot{Y}^t$ is the binary prediction binarized by threshold $t$ and $Y$ is the ground-truth saliency map. $1(\cdot)$ is an indicator function that evaluates to $1$ if its argument is true and 0 otherwise. To incorporate the F-measure into CNN and optimize it in an end-to-end manner, we have to define a decomposable F-measure that is differentiable over posterior $\hat{Y}$. Based on this motivation, we reformulate the true positive, false positive and false negative based on the continuous posterior $\hat{Y}$:
$$ \begin{equation} \begin{split} TP(\hat{Y}, Y) &= \sum\nolimits_i \hat{y}_i \cdot y_i, \\ FP(\hat{Y}, Y) &= \sum\nolimits_i \hat{y}_i \cdot (1 - y_i), \\ FN(\hat{Y}, Y) &= \sum\nolimits_i (1-\hat{y}_i) \cdot y_i. \end{split} \label{eq:tpfp} \end{equation} $$Given the definitions in Eq.\ref{eq:tpfp}, precision $p$ and recall $r$ are:
$$ \begin{equation} \begin{split} p(\hat{Y}, Y) &= \frac{TP}{TP + FP} \\ r(\hat{Y}, Y) &= \frac{TP}{TP + FN}. \end{split} \label{pr} \end{equation} $$Finally our relaxed F-measure can be written as:
$$ \begin{equation} \begin{split} F(\hat{Y}, Y) &= \frac{(1+\beta^2) p \cdot r}{\beta^2 p + r} \\ &= \frac{(1 + \beta^2)TP}{\beta^2(TP + FN) + (TP + FP)} \\ &= \frac{(1 + \beta^2)TP}{H}, \end{split} \label{f} \end{equation} $$where $H\! =\! \beta^2(TP + FN) + (TP + FP)$. Due to the relaxation in Eq.\ref{eq:tpfp}, Eq.\ref{f} is decomposable w.r.t the posterior $\hat{Y}$, therefore can be integrated in CNN architecture trained with back-prop.
In order to maximize the relaxed F-measure in CNNs in an end-to-end manner, we define our proposed F-measure based loss (FLoss) function $\mathcal{L}_{F}$ as:
$$ \begin{equation} \mathcal{L}_{F}(\hat{Y}, Y) = 1 - F = 1 - \frac{(1 + \beta^2)TP}{H}\label{eq:floss}. \end{equation} $$Minimizing $\mathcal{L}_{F}(\hat{Y}, Y)$ is equivalent to maximizing the relaxed F-measure. The partial derivative of loss $\mathcal{L}_{F}$ over network activation $\hat{Y}$ at location $i$ is:
$$ \begin{equation} \begin{split} \frac{\partial \mathcal{L}_{F}}{\partial \hat{y}_i} &= -\frac{\partial F}{\partial \hat{y}_i} \\ &= -\Big(\frac{\partial F}{\partial TP}\cdot \frac{\partial TP}{\partial \hat{y}_i} + \frac{\partial F}{\partial H }\cdot \frac{\partial H }{\partial \hat{y}_i}\Big) \\ &= -\Big(\frac{(1+\beta^2)y_i}{H} - \frac{(1+\beta^2)TP}{H^2}\Big) \\ &= \frac{(1+\beta^2)TP}{H^2} - \frac{(1+\beta^2)y_i}{H} .\\ \end{split}\label{eq:grad-floss} \end{equation} $$@InProceedings{zhao2019optimizing,
author = {Kai Zhao and Shanghua Gao and Wenguan Wang and Ming-ming Cheng},
title = {Optimizing the {F}-measure for Threshold-free Salient Object Detection},
booktitle = {The IEEE International Conference on Computer Vision (ICCV)},
month = {Oct},
year = {2019},
url = {http://kaizhao.net/fmeasure},
}
def floss(prediction, target, beta=0.3, log_like=False):
EPS = 1e-10
N = N = prediction.size(0)
TP = (prediction * target).view(N, -1).sum(dim=1)
H = beta * target.view(N, -1).sum(dim=1) + prediction.view(N, -1).sum(dim=1)
fmeasure = (1 + beta) * TP / (H + EPS)
if log_like:
floss = -torch.log(fmeasure)
else:
floss = (1 - fmeasure)
return floss
from torch import nn
class FLoss(nn.Module):
def __init__(self, beta=0.3, log_like=False):
super(FLoss, self).__init__()
self.beta = beta
self.log_like = log_like
def forward(self, prediction, target):
EPS = 1e-10
N = prediction.size(0)
TP = (prediction * target).view(N, -1).sum(dim=1)
H = self.beta * target.view(N, -1).sum(dim=1) + prediction.view(N, -1).sum(dim=1)
fmeasure = (1 + self.beta) * TP / (H + EPS)
if self.log_like:
floss = -torch.log(fmeasure)
else:
floss = (1 - fmeasure)
return floss
(The comment system is provided by Disqus that is blocked by the GFW. For users from mainland China you may need a VPN.)