Optimizing the F-measure for Threshold-free Salient Object Detection
1 College of computer science, Nankai University, Tianjin, China.
2 Inception Institude of Artificial Intelligence (IIAI), Abu Dhabi, UAE.
Current CNN-based solutions to salient object detection (SOD) mainly rely on the optimization of cross-entropy loss (CELoss). Then the quality of detected saliency maps is often evaluated in terms of F-measure. In this paper, we investigate an interesting issue: can we consistently use the F-measure formulation in both training and evaluation for SOD? By reformulating the standard F-measure we propose the relaxed F-measure which is differentiable w.r.t the posterior and can be easily appended to the back of CNNs as the loss function. Compared to the conventional cross-entropy loss of which the gradients decrease dramatically in the saturated area, our loss function, named FLoss, holds considerable gradients even when the activation approaches the target. Consequently, the FLoss can continuously force the network to produce polarized activations. Comprehensive benchmarks on several popular datasets show that FLoss outperforms the state-of-the-art with a considerable margin. More specifically, due to the polarized predictions, our method is able to obtain high-quality saliency maps without carefully tuning the optimal threshold, showing significant advantages in real-world applications.
In the standard F-measure, the true positive, false positive and false negative are defined as the number of corresponding samples:
$$ \begin{equation} \begin{split} TP(\dot{Y}^t, Y) &= \sum\nolimits_i 1(y_i==1 \ \text{and} \ \dot{y}^t_i==1) \\ FP(\dot{Y}^t, Y) &= \sum\nolimits_i 1(y_i==0 \ \text{and} \ \dot{y}^t_i==1) \\ FN(\dot{Y}^t, Y) &= \sum\nolimits_i 1(y_i==1 \ \text{and} \ \dot{y}^t_i==0), \end{split} \label{eq:tpfp0} \end{equation} $$where $Y$ is the ground-truth, $\dot{Y}^t$ is the binary prediction binarized by threshold $t$ and $Y$ is the ground-truth saliency map. $1(\cdot)$ is an indicator function that evaluates to $1$ if its argument is true and 0 otherwise. To incorporate the F-measure into CNN and optimize it in an end-to-end manner, we have to define a decomposable F-measure that is differentiable over posterior $\hat{Y}$. Based on this motivation, we reformulate the true positive, false positive and false negative based on the continuous posterior $\hat{Y}$:
$$ \begin{equation} \begin{split} TP(\hat{Y}, Y) &= \sum\nolimits_i \hat{y}_i \cdot y_i, \\ FP(\hat{Y}, Y) &= \sum\nolimits_i \hat{y}_i \cdot (1 - y_i), \\ FN(\hat{Y}, Y) &= \sum\nolimits_i (1-\hat{y}_i) \cdot y_i. \end{split} \label{eq:tpfp} \end{equation} $$Given the definitions in Eq.\ref{eq:tpfp}, precision $p$ and recall $r$ are:
$$ \begin{equation} \begin{split} p(\hat{Y}, Y) &= \frac{TP}{TP + FP} \\ r(\hat{Y}, Y) &= \frac{TP}{TP + FN}. \end{split} \label{pr} \end{equation} $$Finally our relaxed F-measure can be written as:
$$ \begin{equation} \begin{split} F(\hat{Y}, Y) &= \frac{(1+\beta^2) p \cdot r}{\beta^2 p + r} \\ &= \frac{(1 + \beta^2)TP}{\beta^2(TP + FN) + (TP + FP)} \\ &= \frac{(1 + \beta^2)TP}{H}, \end{split} \label{f} \end{equation} $$where $H\! =\! \beta^2(TP + FN) + (TP + FP)$. Due to the relaxation in Eq.\ref{eq:tpfp}, Eq.\ref{f} is decomposable w.r.t the posterior $\hat{Y}$, therefore can be integrated in CNN architecture trained with back-prop.
Surface plot of different loss functions in a 2-point 2-class classification circumstance. Left: FLoss, Mid: Log-FLoss, Right: Cross-entropy loss. In top row the ground-truth is [0, 1] and in bottom row the ground-truth is [1, 1]. Compared with cross-entropy loss (and Log-FLoss), FLoss holds considerable gradient even in the saturated area, leading to polarized predictions.
Salient object detection examples on several popular datasets. F-DHS, F-Amulet and F-DSS indicate the original architectures trained with our proposed FLoss. Floss leads to sharp salient confidence especially on the object boundaries.
Quantitative comparisons with compeitor methods.
Precision, Recall, F-measure and Maximal F-measure (·) of DSS (- - -) and F-DSS (---) under different thresholds. DSS tends to predict unknown pixels as the majority class--the background, resulting in high precision but low recall. FLoss is able to find a better compromise between precision and recall.
FLoss (solid lines) achieves high F-measure under a larger range of thresholds, presenting stability against the changing of threshold.
@InProceedings{zhao2019optimizing,
author = {Kai Zhao and Shanghua Gao and Wenguan Wang and Ming-ming Cheng},
title = {Optimizing the {F}-measure for Threshold-free Salient Object Detection},
booktitle = {The IEEE International Conference on Computer Vision (ICCV)},
month = {Oct},
year = {2019},
url = {http://kaizhao.net/fmeasure},
}
def floss(prediction, target, beta=0.3, log_like=False):
EPS = 1e-10
N = N = prediction.size(0)
TP = (prediction * target).view(N, -1).sum(dim=1)
H = beta * target.view(N, -1).sum(dim=1) + prediction.view(N, -1).sum(dim=1)
fmeasure = (1 + beta) * TP / (H + EPS)
if log_like:
floss = -torch.log(fmeasure)
else:
floss = (1 - fmeasure)
return floss
from torch import nn
class FLoss(nn.Module):
def __init__(self, beta=0.3, log_like=False):
super(FLoss, self).__init__()
self.beta = beta
self.log_like = log_like
def forward(self, prediction, target):
EPS = 1e-10
N = prediction.size(0)
TP = (prediction * target).view(N, -1).sum(dim=1)
H = self.beta * target.view(N, -1).sum(dim=1) + prediction.view(N, -1).sum(dim=1)
fmeasure = (1 + self.beta) * TP / (H + EPS)
if self.log_like:
floss = -torch.log(fmeasure)
else:
floss = (1 - fmeasure)
return floss