Optimization in Machine Learning: Robust or global minimum?

Here we discuss how convex problems are solved and optimised in machine learning/deep learning.

By Nikolaos Vasiloglou.

Optimization in Machine Learning

In 2014 the Facebook best paper award @UAI was awarded to “Universal Convexification via Risk-Aversion” b.y Krishnamurthy Dvijotham; Maryam Fazel; Emanuel Todorov http://www.auai.org/uai2014/proceedings/individuals/121.pdf. Three years later we follow up with the authors in a brief interview:

NV: First of all congratulations for the award. We understand that in convex problems it is much easier to find the global optimum.

KD, MF: Thank you! We appreciate the opportunity to participate in this discussion.

1. NV: Does the convexified problem always have the same minimum with the original problem?

KD, MF: No, the convexified problem can have a minimum that is quite different from the original problem. The motivation for our paper comes from the fact that in many problems (like control and reinforcement learning) one is interested in a “robust” minimum (a minimum such that the cost does not increase much when you perturb the parameters). Our method destroys non-robust minima and preserves a single robust minimum of the problem.

2. NV: We have seen a lot of interest in analyzing non-convex problems that have all their local minimum equivalent to the global minimum. For example A. Anandkumar’s group has been working on how to guarantee minimums in nonConvex problems. How does your work compare to that direction?

KD: Yes, the results along these lines are very interesting and have potential connections with our work. In particular, the papers https://arxiv.org/abs/1503.03712, https://arxiv.org/pdf/1610.09322.pdf study the approach of smoothing nonconvex functions to eliminate local minima, but do this in a gradual fashion (starting with a large amount of smoothing and gradually reduce it while tracking the global minimum). Our approach does this in a one-shot way (adds a large amount of noise/risk-aversion) to directly convexify the problem, but this can be improved (in the sense of allowing weaker assumptions to guarantee location of a robust minimum) potentially by doing this in a graduated fashion. We are excited to further explore these connections.

MF: Just to add that our work (as well as the papers KD mentioned above) are not focused on the restricted case where all local minima are equivalent to the global minimum.

3. NV:  The biggest headache is with Deep Learning. You have an example with MNIST on your paper. Have you tried it in other cases?

KD: The results we had on MNIST were quite preliminary and definitely need more careful investigation and comparison with the latest deep learning algorithms . Unfortunately, we did not have a chance to follow up with more detailed experiments – I left for a postdoc shortly after we wrote this paper and shifted focus to other topics.  We plan to get back to this over the next few months and write an extended version of this paper. However, several papers by other authors have appeared in the literature that explore similar ideas of adding noise to neural networks to improve optimization and generalization (for example, https://openreview.net/forum?id=B1YfAfcgl).

4. NV: The deep learning enthusiasts say that SGD works most of the times and with a couple of restarts you train your network. How do you comment on that? Why do you think your method is still better?

KD: The contribution of our paper is a specific approach that quantifies precisely how much noise/risk-aversion one needs to convexify the problem. We do not expect the approach to compete directly with SGD at least for standard supervised learning problems, although it may offer some insights on how much noise to add so that the training is less susceptible to local minima (MF: Recently, noisy gradient descent and Langevin dynamics have received attention for a similar goal).

However, our main motivation for developing this was in RL and control problems where robustness is inherently a desirable quality (for example it is a way to ensure that a robotic system will not violate safety constraints in spite of perturbations and disturbances). Our intuition was that finding robust minima is actually easier and the theory developed in the paper provides some rigorous justification for this statement.

6. NV: How was your paper received from the community? Were there any follow ups? Your method seems very straightforward to implement. Do you think that a tensorFlow implementation (or any other popular package) would help people adopt it more?

KD: We have received some followup from recent papers applying similar ideas to computer vision (https://link.springer.com/chapter/10.1007/978-3-319-14612-6_4) and general nonconvex optimization (https://arxiv.org/abs/1602.02191). However, since we did not continue to work on this ourselves, the paper was not well publicized. We hope to get back to it soon, improve the algorithm to make it practical and publish an open source implementation. The basic method in the paper is straightforward to implement. However, to achieve good performance in practice, we believe that the approach needs to be combined with variance reduction ideas developed by researchers working on SGD in recent years. As I said in the previous question, our view is that the method is most appropriate for problems where robustness is of direct interest as opposed to standard supervised learning problems. We hope to develop an implementation of the approach suitable to RL problems using the OpenAIgym environment (https://gym.openai.com/) over the next few months and make this publicly available.

7. NV: Tell us about other successful approaches in the problem of finding the global minimum in practice.

KD: Global optimization of nonconvex problems is challenging and difficult in general. In the optimization community, several approaches have been developed that rely on branch and bound, i.e., constructing and refining convex relaxations combined with some kind of search strategy (see, for example BARON: http://archimedes.cheme.cmu.edu/?q=baron). For problems with no additional structure, these approaches are the best one can hope for. However, the performance of these algorithms on any specific problem can vary widely (in the worst case they are doing something akin to a brute force search). For the case of polynomial optimization problems, the Lasserre hierarchy (http://homepages.laas.fr/lasserre/book-flyer.pdf) allows the computation of global optima via a hierarchy of semidefinite programs growing in size. In theory, one needs to solve very large semidefinite programs to find the global optimum, but in practice, for several instances of polynomial optimization problems, a reasonably sized semidefinite program suffices to compute the global optimum.

8. NV: Is it worth looking for convex envelopes for nonconvex problems? How much slower would your algorithm would be compared to the convex envelope?

KD: This is a great question. The problem of constructing the convex envelope of a general nonconvex function is again NP-hard and difficult except in special cases. In the very special cases where constructing a convex envelope is easy, one can minimize this envelope if one is interested in finding the minimum of the original objective function. As stated earlier, however, the point of our approach is to find a robust minimum rather than the global minimum and it is not obvious that the convex envelope will be useful for this goal. It may also be possible to interpret our approach in terms of a different kind of envelope known as the Moreau envelope; we are currently investigating these connections.

9. NV: In your experiments you emphasize more on the generalization error and less on the proximity to the global minimum.  Why is that? Does your method actually solve both problems?

KD: Our approach can produce an optimum that is far from the global minimum since it tries to find a robust minimum. The hypothesis that we tested in the experiments was that perhaps the robust minimum is better than the global minimum in terms of generalization error – the reason to expect this is that in the experiments, we added noise to the prediction of the SVM (w’x+b) and the intermediate neurons of a neural network. The idea was that adding noise of this kind and trying to minimize the expected learning loss over the noise  tries to ensure that the prediction of the algorithm remains stable not just for a given input but also for a range of noisy perturbations around it. Of course, this was not really justified rigorously in our paper, but the preliminary experiments we had indicated that this was indeed the case. We are investigating ways to formalize this; it has been known that stability of learning algorithms to perturbations is connected to generalization ability http://www.jmlr.org/papers/volume2/bousquet02a/bousquet02a.pdf).

10. NV: Your method adds noise to the input and optimizes over the expectation? Is there a link with bayesian methods? Any connections to accelerated Monte Carlo variants? Also does your method provide a distribution for the learned model?

KD: There is no direct link since the noise in our approach is injected on purpose and does not come from a probabilistic model as in the Bayesian approach . Also, at the end, we do not obtain a posterior distribution over the model parameters as in a Bayesian approach – only a point estimate. There may be ways to leverage Monte Carlo algorithms to obtain a better estimate of the stochastic gradient sample but we have not studied that yet. Our method only produces a single estimate of the model parameters, not an entire distribution.

11. NV:  Are there any connections of your method with dropout (with regards to improving generalization)?

KD,MF: We are not aware of any direct connections.

12. NV: The best paper at ICLR 2017 “Understanding Deep Learning required Rethinking Generalization” ends with the following phrase: “Another insight resulting from our experiments is that optimization continues to be empirically easy even if the resulting model does not generalize. This shows that the reasons for why optimization is empirically easy must be different from the true cause of generalization”. You seem to treat both (optimization and generalization) at the same time. Does your approach contradict the findings of this paper?

KD: I have not studied this paper carefully, but it does have an interesting observation. As I mentioned earlier, our approach does seem to indicate that finding robust minima is easier than finding the global and there is some evidence (from our experiments and other learning theory papers) that robustness and generalization are connected. However, our experiments were fairly preliminary and used very small data sets and neural networks. Recent work in deep learning has shown that optimization issues (local minima etc.) tend to go away in the realm of big data and large neural networks – thus the situation studied in the recent ICLR paper may actually present a different regime where both optimization and generalization have different qualitative properties than a setup with a small number of model parameters and small datasets. I would not say that the two studies contradict each other but the new ICLR work is intriguing since it goes against the conventional wisdom on learning and generalization.

Original. Reposted with permission.

Bio: Nikolaos Vasiloglou is Organizing Committee Member at UAI 2017 and has several years experience in building/developing distributed machine learning systems.