Abstract:
Current CNNbased solutions to salient object detection (SOD) mainly rely on the optimization of crossentropy loss (CELoss). Then the quality of detected saliency maps is often evaluated in terms of Fmeasure. In this paper, we investigate an interesting issue: can we consistently use the Fmeasure formulation in both training and evaluation for SOD? By reformulating the standard Fmeasure we propose the relaxed Fmeasure which is differentiable w.r.t the posterior and can be easily appended to the back of CNNs as the loss function. Compared to the conventional crossentropy loss of which the gradients decrease dramatically in the saturated area, our loss function, named FLoss, holds considerable gradients even when the activation approaches the target. Consequently, the FLoss can continuously force the network to produce polarized activations. Comprehensive benchmarks on several popular datasets show that FLoss outperforms the stateoftheart with a considerable margin. More specifically, due to the polarized predictions, our method is able to obtain highquality saliency maps without carefully tuning the optimal threshold, showing significant advantages in realworld applications.
Formulation of Proposed Loss Function:
In the standard Fmeasure, the true positive, false positive and false negative are defined as the number of corresponding samples:
$$ \begin{equation} \begin{split} TP(\dot{Y}^t, Y) &= \sum\nolimits_i 1(y_i==1 \ \text{and} \ \dot{y}^t_i==1) \\ FP(\dot{Y}^t, Y) &= \sum\nolimits_i 1(y_i==0 \ \text{and} \ \dot{y}^t_i==1) \\ FN(\dot{Y}^t, Y) &= \sum\nolimits_i 1(y_i==1 \ \text{and} \ \dot{y}^t_i==0), \end{split} \label{eq:tpfp0} \end{equation} $$where $Y$ is the groundtruth, $\dot{Y}^t$ is the binary prediction binarized by threshold $t$ and $Y$ is the groundtruth saliency map. $1(\cdot)$ is an indicator function that evaluates to $1$ if its argument is true and 0 otherwise. To incorporate the Fmeasure into CNN and optimize it in an endtoend manner, we have to define a decomposable Fmeasure that is differentiable over posterior $\hat{Y}$. Based on this motivation, we reformulate the true positive, false positive and false negative based on the continuous posterior $\hat{Y}$:
$$ \begin{equation} \begin{split} TP(\hat{Y}, Y) &= \sum\nolimits_i \hat{y}_i \cdot y_i, \\ FP(\hat{Y}, Y) &= \sum\nolimits_i \hat{y}_i \cdot (1  y_i), \\ FN(\hat{Y}, Y) &= \sum\nolimits_i (1\hat{y}_i) \cdot y_i. \end{split} \label{eq:tpfp} \end{equation} $$Given the definitions in Eq.\ref{eq:tpfp}, precision $p$ and recall $r$ are:
$$ \begin{equation} \begin{split} p(\hat{Y}, Y) &= \frac{TP}{TP + FP} \\ r(\hat{Y}, Y) &= \frac{TP}{TP + FN}. \end{split} \label{pr} \end{equation} $$Finally our relaxed Fmeasure can be written as:
$$ \begin{equation} \begin{split} F(\hat{Y}, Y) &= \frac{(1+\beta^2) p \cdot r}{\beta^2 p + r} \\ &= \frac{(1 + \beta^2)TP}{\beta^2(TP + FN) + (TP + FP)} \\ &= \frac{(1 + \beta^2)TP}{H}, \end{split} \label{f} \end{equation} $$where $H\! =\! \beta^2(TP + FN) + (TP + FP)$. Due to the relaxation in Eq.\ref{eq:tpfp}, Eq.\ref{f} is decomposable w.r.t the posterior $\hat{Y}$, therefore can be integrated in CNN architecture trained with backprop.
In order to maximize the relaxed Fmeasure in CNNs in an endtoend manner, we define our proposed Fmeasure based loss (FLoss) function $\mathcal{L}_{F}$ as:
$$ \begin{equation} \mathcal{L}_{F}(\hat{Y}, Y) = 1  F = 1  \frac{(1 + \beta^2)TP}{H}\label{eq:floss}. \end{equation} $$Minimizing $\mathcal{L}_{F}(\hat{Y}, Y)$ is equivalent to maximizing the relaxed Fmeasure. The partial derivative of loss $\mathcal{L}_{F}$ over network activation $\hat{Y}$ at location $i$ is:
$$ \begin{equation} \begin{split} \frac{\partial \mathcal{L}_{F}}{\partial \hat{y}_i} &= \frac{\partial F}{\partial \hat{y}_i} \\ &= \Big(\frac{\partial F}{\partial TP}\cdot \frac{\partial TP}{\partial \hat{y}_i} + \frac{\partial F}{\partial H }\cdot \frac{\partial H }{\partial \hat{y}_i}\Big) \\ &= \Big(\frac{(1+\beta^2)y_i}{H}  \frac{(1+\beta^2)TP}{H^2}\Big) \\ &= \frac{(1+\beta^2)TP}{H^2}  \frac{(1+\beta^2)y_i}{H} .\\ \end{split}\label{eq:gradfloss} \end{equation} $$Experimental Results:
Visual results:
Quantitative Comparisons:
Compromise Between Precision and Recall:
Stability Against Thresholds:
Citation:
If our method is helpful to your research, please kindly consider to cite:@InProceedings{zhao2019optimizing,
author = {Kai Zhao and Shanghua Gao and Wenguan Wang and Mingming Cheng},
title = {Optimizing the {F}measure for Thresholdfree Salient Object Detection},
booktitle = {The IEEE International Conference on Computer Vision (ICCV)},
month = {Oct},
year = {2019},
url = {http://kaizhao.net/fmeasure},
}
Code and Pretrained Model:
Simple PyTorch implementation:

In Functional interface:
def floss(prediction, target, beta=0.3, log_like=False): EPS = 1e10 N = N = prediction.size(0) TP = (prediction * target).view(N, 1).sum(dim=1) H = beta * target.view(N, 1).sum(dim=1) + prediction.view(N, 1).sum(dim=1) fmeasure = (1 + beta) * TP / (H + EPS) if log_like: floss = torch.log(fmeasure) else: floss = (1  fmeasure) return floss

In Module interface:
from torch import nn class FLoss(nn.Module): def __init__(self, beta=0.3, log_like=False): super(FLoss, self).__init__() self.beta = beta self.log_like = log_like def forward(self, prediction, target): EPS = 1e10 N = prediction.size(0) TP = (prediction * target).view(N, 1).sum(dim=1) H = self.beta * target.view(N, 1).sum(dim=1) + prediction.view(N, 1).sum(dim=1) fmeasure = (1 + self.beta) * TP / (H + EPS) if self.log_like: floss = torch.log(fmeasure) else: floss = (1  fmeasure) return floss
Full training code:
 Full training code (based on caffe) is available at https://github.com/zeakey/iccv2019fmeasure .
 A model pretrained on the MSRAB dataset can be downloaded from: data.kaizhao.net/projects/fmeasuresaliency/flossmsrabpretrained.zip .
 Detection and evaluation results on the ECSSD dataset can be found here: data.kaizhao.net/projects/fmeasuresaliency/flossecssdresults.zip .