# PyTorch与caffe中SGD算法实现的一点小区别

Posted on

$$$$v_{t+1} = \mu v_t - \varepsilon \Delta f(\theta_t)$$$$
$$$$\begin{split} \theta_{t+1} &= \theta_t + v_{t+1}\\ & = \theta_t + \mu \cdot v_t - \varepsilon \cdot \Delta f(\theta) \end{split}$$$$

(1)式中$\Delta f(\theta_t)$表示目标函数的导数，$\mu$表示momentum的系数(在[1]中被称为velocity)，$\varepsilon$表示学习率。

template <typename Dtype>
void SGDSolver<Dtype>::ComputeUpdateValue(int param_id, Dtype rate) {
const vector<Blob<Dtype>*>& net_params = this->net_->learnable_params();
const vector<float>& net_params_lr = this->net_->params_lr();
Dtype momentum = this->param_.momentum();
Dtype local_rate = rate * net_params_lr[param_id];
// Compute the update to history, then copy it to the parameter diff.
switch (Caffe::mode()) {
case Caffe::CPU: {
caffe_cpu_axpby(net_params[param_id]->count(), local_rate,
net_params[param_id]->cpu_diff(), momentum,
history_[param_id]->mutable_cpu_data());
caffe_copy(net_params[param_id]->count(),
history_[param_id]->cpu_data(),
net_params[param_id]->mutable_cpu_diff());
break;
}
case Caffe::GPU: {
#ifndef CPU_ONLY
sgd_update_gpu(net_params[param_id]->count(),
net_params[param_id]->mutable_gpu_diff(),
history_[param_id]->mutable_gpu_data(),
momentum, local_rate);
#else
NO_GPU;
#endif
break;
}
default:
LOG(FATAL) << "Unknown caffe mode: " << Caffe::mode();
}
}


caffe_cpu_axpby(net_params[param_id]->count(), local_rate,
net_params[param_id]->cpu_diff(), momentum,
history_[param_id]->mutable_cpu_data());


def step(self, closure=None):
"""Performs a single optimization step.
Arguments:
closure (callable, optional): A closure that reevaluates the model
and returns the loss.
"""
loss = None
if closure is not None:
loss = closure()

for group in self.param_groups:
weight_decay = group['weight_decay']
momentum = group['momentum']
dampening = group['dampening']
nesterov = group['nesterov']

for p in group['params']:
continue
if weight_decay != 0:
if momentum != 0:
param_state = self.state[p]
if 'momentum_buffer' not in param_state:
buf = param_state['momentum_buffer'] = torch.zeros_like(p.data)
else:
buf = param_state['momentum_buffer']
if nesterov:
else:
d_p = buf



$$$$v_{t+1} = \mu v_t - \Delta f(\theta_t)$$$$
$$$$\begin{split} \begin{split} \theta_{t+1} &= \theta_t + v_{t+1}\\ & = \theta_t + \mu \cdot v_t - \varepsilon \cdot \Delta f(\theta) \end{split} \end{split}$$$$ 为了方便比较，我们把(1)(2)式搬运过来:
$$v_{t+1} = \mu v_t - \varepsilon \Delta f(\theta_t)$$
$$\begin{split} \theta_{t+1} &= \theta_t + v_{t+1}\\ & = \theta_t + \mu \cdot v_t - \varepsilon \cdot \Delta f(\theta) \end{split}$$

[1] Sutskever, Ilya, et al. “On the importance of initialization and momentum in deep learning.“International conference on machine learning. 2013.

[2] Goyal, Priya, et al. “Accurate, large minibatch SGD: training imagenet in 1 hour.” arXiv preprint arXiv:1706.02677 (2017).