Stable Models
As we work with overparameterized models we have a whole continous manifold of minima. To find a unique minimum we want to focus on modelling stable models.
A model is stable when removing one datapoint from the training set does not have a big impact on the overall gradient distribution.
In a local minimum we always have a fixed balanced force field of the gradients. If we didn’t have this balanced field we could still improve in the direction of the gradients and would therefore not be in a local minimum.
Doing a Taylor Expansion we get for Pattern by Pattern Learning: Here we have a Smoothness Penalty weighted by the Variance of the Gradient for sample and weight . So with a high variance gradient we try to enforce smoother weights such that the next gradient will be of less variance.
The same can be done for Vario-Eta Learning: where we have a global unconditional penalty.