Abstract:
Current CNN-based solutions to salient object detection (SOD) mainly rely on the optimization of cross-entropy loss (CELoss). Then the quality of detected saliency maps is often evaluated in terms of F-measure. In this paper, we investigate an interesting issue: can we consistently use the F-measure formulation in both training and evaluation for SOD? By reformulating the standard F-measure we propose the relaxed F-measure which is differentiable w.r.t the posterior and can be easily appended to the back of CNNs as the loss function. Compared to the conventional cross-entropy loss of which the gradients decrease dramatically in the saturated area, our loss function, named FLoss, holds considerable gradients even when the activation approaches the target. Consequently, the FLoss can continuously force the network to produce polarized activations. Comprehensive benchmarks on several popular datasets show that FLoss outperforms the state-of-the-art with a considerable margin. More specifically, due to the polarized predictions, our method is able to obtain high-quality saliency maps without carefully tuning the optimal threshold, showing significant advantages in real-world applications.
Formulation of Proposed Loss Function:
In the standard F-measure, the true positive, false positive and false negative are defined as the number of corresponding samples:
$$ \begin{equation} \begin{split} TP(\dot{Y}^t, Y) &= \sum\nolimits_i 1(y_i==1 \ \text{and} \ \dot{y}^t_i==1) \\ FP(\dot{Y}^t, Y) &= \sum\nolimits_i 1(y_i==0 \ \text{and} \ \dot{y}^t_i==1) \\ FN(\dot{Y}^t, Y) &= \sum\nolimits_i 1(y_i==1 \ \text{and} \ \dot{y}^t_i==0), \end{split} \label{eq:tpfp0} \end{equation} $$where $Y$ is the ground-truth, $\dot{Y}^t$ is the binary prediction binarized by threshold $t$ and $Y$ is the ground-truth saliency map. $1(\cdot)$ is an indicator function that evaluates to $1$ if its argument is true and 0 otherwise. To incorporate the F-measure into CNN and optimize it in an end-to-end manner, we have to define a decomposable F-measure that is differentiable over posterior $\hat{Y}$. Based on this motivation, we reformulate the true positive, false positive and false negative based on the continuous posterior $\hat{Y}$:
$$ \begin{equation} \begin{split} TP(\hat{Y}, Y) &= \sum\nolimits_i \hat{y}_i \cdot y_i, \\ FP(\hat{Y}, Y) &= \sum\nolimits_i \hat{y}_i \cdot (1 - y_i), \\ FN(\hat{Y}, Y) &= \sum\nolimits_i (1-\hat{y}_i) \cdot y_i. \end{split} \label{eq:tpfp} \end{equation} $$Given the definitions in Eq.\ref{eq:tpfp}, precision $p$ and recall $r$ are:
$$ \begin{equation} \begin{split} p(\hat{Y}, Y) &= \frac{TP}{TP + FP} \\ r(\hat{Y}, Y) &= \frac{TP}{TP + FN}. \end{split} \label{pr} \end{equation} $$Finally our relaxed F-measure can be written as:
$$ \begin{equation} \begin{split} F(\hat{Y}, Y) &= \frac{(1+\beta^2) p \cdot r}{\beta^2 p + r} \\ &= \frac{(1 + \beta^2)TP}{\beta^2(TP + FN) + (TP + FP)} \\ &= \frac{(1 + \beta^2)TP}{H}, \end{split} \label{f} \end{equation} $$where $H\! =\! \beta^2(TP + FN) + (TP + FP)$. Due to the relaxation in Eq.\ref{eq:tpfp}, Eq.\ref{f} is decomposable w.r.t the posterior $\hat{Y}$, therefore can be integrated in CNN architecture trained with back-prop.
In order to maximize the relaxed F-measure in CNNs in an end-to-end manner, we define our proposed F-measure based loss (FLoss) function $\mathcal{L}_{F}$ as:
$$ \begin{equation} \mathcal{L}_{F}(\hat{Y}, Y) = 1 - F = 1 - \frac{(1 + \beta^2)TP}{H}\label{eq:floss}. \end{equation} $$Minimizing $\mathcal{L}_{F}(\hat{Y}, Y)$ is equivalent to maximizing the relaxed F-measure. The partial derivative of loss $\mathcal{L}_{F}$ over network activation $\hat{Y}$ at location $i$ is:
$$ \begin{equation} \begin{split} \frac{\partial \mathcal{L}_{F}}{\partial \hat{y}_i} &= -\frac{\partial F}{\partial \hat{y}_i} \\ &= -\Big(\frac{\partial F}{\partial TP}\cdot \frac{\partial TP}{\partial \hat{y}_i} + \frac{\partial F}{\partial H }\cdot \frac{\partial H }{\partial \hat{y}_i}\Big) \\ &= -\Big(\frac{(1+\beta^2)y_i}{H} - \frac{(1+\beta^2)TP}{H^2}\Big) \\ &= \frac{(1+\beta^2)TP}{H^2} - \frac{(1+\beta^2)y_i}{H} .\\ \end{split}\label{eq:grad-floss} \end{equation} $$Experimental Results:
Visual results:
Quantitative Comparisons:
Compromise Between Precision and Recall:
Stability Against Thresholds:
Citation:
If our method is helpful to your research, please kindly consider to cite:@InProceedings{zhao2019optimizing,
author = {Kai Zhao and Shanghua Gao and Wenguan Wang and Ming-ming Cheng},
title = {Optimizing the {F}-measure for Threshold-free Salient Object Detection},
booktitle = {The IEEE International Conference on Computer Vision (ICCV)},
month = {Oct},
year = {2019},
url = {http://kaizhao.net/fmeasure},
}
Code and Pretrained Model:
Simple PyTorch implementation:
-
In Functional interface:
def floss(prediction, target, beta=0.3, log_like=False): EPS = 1e-10 N = N = prediction.size(0) TP = (prediction * target).view(N, -1).sum(dim=1) H = beta * target.view(N, -1).sum(dim=1) + prediction.view(N, -1).sum(dim=1) fmeasure = (1 + beta) * TP / (H + EPS) if log_like: floss = -torch.log(fmeasure) else: floss = (1 - fmeasure) return floss
-
In Module interface:
from torch import nn class FLoss(nn.Module): def __init__(self, beta=0.3, log_like=False): super(FLoss, self).__init__() self.beta = beta self.log_like = log_like def forward(self, prediction, target): EPS = 1e-10 N = prediction.size(0) TP = (prediction * target).view(N, -1).sum(dim=1) H = self.beta * target.view(N, -1).sum(dim=1) + prediction.view(N, -1).sum(dim=1) fmeasure = (1 + self.beta) * TP / (H + EPS) if self.log_like: floss = -torch.log(fmeasure) else: floss = (1 - fmeasure) return floss
Full training code:
- Full training code (based on caffe) is available at https://github.com/zeakey/iccv2019-fmeasure .
- A model pretrained on the MSRAB dataset can be downloaded from: data.kaizhao.net/projects/fmeasure-saliency/floss-msrab-pretrained.zip .
- Detection and evaluation results on the ECSSD dataset can be found here: data.kaizhao.net/projects/fmeasure-saliency/floss-ecssd-results.zip .