Optimizing the F-measure for Threshold-free Salient Object Detection




Current CNN-based solutions to salient object detection (SOD) mainly rely on the optimization of cross-entropy loss (CELoss). Then the quality of detected saliency maps is often evaluated in terms of F-measure. In this paper, we investigate an interesting issue: can we consistently use the F-measure formulation in both training and evaluation for SOD? By reformulating the standard F-measure we propose the relaxed F-measure which is differentiable w.r.t the posterior and can be easily appended to the back of CNNs as the loss function. Compared to the conventional cross-entropy loss of which the gradients decrease dramatically in the saturated area, our loss function, named FLoss, holds considerable gradients even when the activation approaches the target. Consequently, the FLoss can continuously force the network to produce polarized activations. Comprehensive benchmarks on several popular datasets show that FLoss outperforms the state-of-the-art with a considerable margin. More specifically, due to the polarized predictions, our method is able to obtain high-quality saliency maps without carefully tuning the optimal threshold, showing significant advantages in real-world applications.

Formulation of Proposed Loss Function:

In the standard F-measure, the true positive, false positive and false negative are defined as the number of corresponding samples:

$$ \begin{equation} \begin{split} TP(\dot{Y}^t, Y) &= \sum\nolimits_i 1(y_i==1 \ \text{and} \ \dot{y}^t_i==1) \\ FP(\dot{Y}^t, Y) &= \sum\nolimits_i 1(y_i==0 \ \text{and} \ \dot{y}^t_i==1) \\ FN(\dot{Y}^t, Y) &= \sum\nolimits_i 1(y_i==1 \ \text{and} \ \dot{y}^t_i==0), \end{split} \label{eq:tpfp0} \end{equation} $$

where $Y$ is the ground-truth, $\dot{Y}^t$ is the binary prediction binarized by threshold $t$ and $Y$ is the ground-truth saliency map. $1(\cdot)$ is an indicator function that evaluates to $1$ if its argument is true and 0 otherwise. To incorporate the F-measure into CNN and optimize it in an end-to-end manner, we have to define a decomposable F-measure that is differentiable over posterior $\hat{Y}$. Based on this motivation, we reformulate the true positive, false positive and false negative based on the continuous posterior $\hat{Y}$:

$$ \begin{equation} \begin{split} TP(\hat{Y}, Y) &= \sum\nolimits_i \hat{y}_i \cdot y_i, \\ FP(\hat{Y}, Y) &= \sum\nolimits_i \hat{y}_i \cdot (1 - y_i), \\ FN(\hat{Y}, Y) &= \sum\nolimits_i (1-\hat{y}_i) \cdot y_i. \end{split} \label{eq:tpfp} \end{equation} $$

Given the definitions in Eq.\ref{eq:tpfp}, precision $p$ and recall $r$ are:

$$ \begin{equation} \begin{split} p(\hat{Y}, Y) &= \frac{TP}{TP + FP} \\ r(\hat{Y}, Y) &= \frac{TP}{TP + FN}. \end{split} \label{pr} \end{equation} $$

Finally our relaxed F-measure can be written as:

$$ \begin{equation} \begin{split} F(\hat{Y}, Y) &= \frac{(1+\beta^2) p \cdot r}{\beta^2 p + r} \\ &= \frac{(1 + \beta^2)TP}{\beta^2(TP + FN) + (TP + FP)} \\ &= \frac{(1 + \beta^2)TP}{H}, \end{split} \label{f} \end{equation} $$

where $H\! =\! \beta^2(TP + FN) + (TP + FP)$. Due to the relaxation in Eq.\ref{eq:tpfp}, Eq.\ref{f} is decomposable w.r.t the posterior $\hat{Y}$, therefore can be integrated in CNN architecture trained with back-prop.

In order to maximize the relaxed F-measure in CNNs in an end-to-end manner, we define our proposed F-measure based loss (FLoss) function $\mathcal{L}_{F}$ as:

$$ \begin{equation} \mathcal{L}_{F}(\hat{Y}, Y) = 1 - F = 1 - \frac{(1 + \beta^2)TP}{H}\label{eq:floss}. \end{equation} $$

Minimizing $\mathcal{L}_{F}(\hat{Y}, Y)$ is equivalent to maximizing the relaxed F-measure. The partial derivative of loss $\mathcal{L}_{F}$ over network activation $\hat{Y}$ at location $i$ is:

$$ \begin{equation} \begin{split} \frac{\partial \mathcal{L}_{F}}{\partial \hat{y}_i} &= -\frac{\partial F}{\partial \hat{y}_i} \\ &= -\Big(\frac{\partial F}{\partial TP}\cdot \frac{\partial TP}{\partial \hat{y}_i} + \frac{\partial F}{\partial H }\cdot \frac{\partial H }{\partial \hat{y}_i}\Big) \\ &= -\Big(\frac{(1+\beta^2)y_i}{H} - \frac{(1+\beta^2)TP}{H^2}\Big) \\ &= \frac{(1+\beta^2)TP}{H^2} - \frac{(1+\beta^2)y_i}{H} .\\ \end{split}\label{eq:grad-floss} \end{equation} $$
pressue plot
Surface plot of different loss functions in a 2-point 2-class classification circumstance. Left: FLoss, Mid: Log-FLoss, Right: Cross-entropy loss. In top row the ground-truth is [0, 1] and in bottom row the ground-truth is [1, 1]. Compared with cross-entropy loss (and Log-FLoss), FLoss holds considerable gradient even in the saturated area, leading to polarized predictions.

Experimental Results:

Visual results:

pressue plot
Salient object detection examples on several popular datasets. F-DHS, F-Amulet and F-DSS indicate the original architectures trained with our proposed FLoss. Floss leads to sharp salient confidence especially on the object boundaries.

Quantitative Comparisons:

pressue plot
Quantitative comparisons with compeitor methods.

Compromise Between Precision and Recall:

pressue plot
Precision, Recall, F-measure and Maximal F-measure (·) of DSS (- - -) and F-DSS (---) under different thresholds. DSS tends to predict unknown pixels as the majority class--the background, resulting in high precision but low recall. FLoss is able to find a better compromise between precision and recall.

Stability Against Thresholds:

pressue plot
FLoss (solid lines) achieves high F-measure under a larger range of thresholds, presenting stability against the changing of threshold.


If our method is helpful to your research, please kindly consider to cite:
  author    = {Kai Zhao and Shanghua Gao and Wenguan Wang and Ming-ming Cheng},
  title     = {Optimizing the {F}-measure for Threshold-free Salient Object Detection},
  booktitle = {The IEEE International Conference on Computer Vision (ICCV)},
  month     = {Oct},
  year      = {2019},
  url       = {http://kaizhao.net/fmeasure},

Code and Pretrained Model:

Simple PyTorch implementation:

Full training code: