1 of 32

300 Years Later, Newton's Method gets an Update

Jeffrey Zhang, Yale BIDS

Joint work with Amir Ali Ahmadi, Abraar Chaudry

AAMU-UAH Joint Seminar, 4/18/2025

3 of 32

Newton’s Method

Newton’s method is an iterative optimization function that minimizes the second order Taylor expansion in each iteration

4 of 32

The Newton iterate

5 of 32

Convergence of Newton’s Method

Classical result in optimization: Newton’s method has quadratic convergence
Formally,

Theorem

6 of 32

Newton’s method doesn’t always work!

Functions with S-shaped gradients are difficult for Newton’s method

7 of 32

Higher order information in Newton’s method

Using higher order derivative information can improve Newton’s method by generating closer approximations

Empirically, using higher order information can result in faster, more stable algorithms

8 of 32

Why haven’t we done this?

Optimizing high degree polynomials is hard!

Implementing these methods requires optimizing higher degree polynomials

Informal theorem: Most optimization problems are NP-hard at degree 4 at the least

Some NP-hard problems:

Globally minimizing a polynomial of degree 4
Locally minimizing a polynomial of degree 4
Finding a second-order point of a polynomial of degree 4
Finding a critical point of a polynomial of degree 3

9 of 32

A bridge to higher-order methods

If there were algorithms to minimize the high degree approximations, then they would immediately yield higher order Newton methods

Example: a third-order Newton method is possible!

Theorem (Silina JZ ’22)

Theorem (Ahmadi, JZ ’22)
Local minima of cubic polynomials can be found by solving SDPs of size polynomial in the number of variables.

10 of 32

A different higher order method

11 of 32

Outline

An Introduction to Sums of Squares

Convergence of Higher Order Newton Methods

Numerical Examples

12 of 32

Sums of Squares

13 of 32

Sum of Squares and Polynomial Optimization

Sum of squares relaxations enable finding lower bounds for polynomial optimization problems

sos

There are extensions for constrained optimization, but they won’t be needed here

14 of 32

Sum of Squares Convexity

Definition

15 of 32

SoS Convex Polynomials and SoS Relaxations

Key theorem (by Lasserre):

Theorem (Lasserre, ’09)

POP:

1^st order Lasserre relaxation:

All we need is the unconstrained version

16 of 32

Outline

An Introduction to Sums of Squares

Convergence of Higher Order Newton Methods

Numerical Examples

17 of 32

Our Newton Algorithm

Main idea: Leverage sos convexity to produce an implementable algorithm

Theorem (Ahmadi, Chaudry, JZ ’22)

18 of 32

To-do checklist

19 of 32

An SoS-convex approximation exists

Lemma

Lemma

A consequence of a more general statement about cones

20 of 32

And it has a unique global minimum

Lemma

Proof (of lemma) sketch:

Use the lemma to lower bound the polynomial by a coercive quadratic function
This polynomial must then be coercive, so it has a global minimum
If this polynomial has multiple global minima, then it cannot be coercive

Lemma

Proof: Using a quadrature rule by Clenshaw and Curtis

Interpretation: When integrating the Hessian of a function along a line in a region where the function is convex, the lowest eigenvalue can be bounded below based on the endpoints of the integration.

21 of 32

Convergence

Lemma

Use a “close enough” to a point with a PD Hessian argument

22 of 32

Convergence

Let’s try to generalize where we can

23 of 32

Convergence

Quadrature again!

Lemma

We can lower bound this using the lemma.

We just got everything we wanted!

24 of 32

Convergence

25 of 32

Convergence

26 of 32

Outline

An Introduction to Sums of Squares

Convergence of Higher Order Newton Methods

Numerical Examples

27 of 32

The univariate setting

28 of 32

Example 1

(strictly convex)

29 of 32

Example 2

(strongly convex)

30 of 32

Example 3

The Beale function (nonconvex)

31 of 32

An addendum: global convergence

Recall the convex regularized framework, which was globally convergent under certain assumptions:

Our algorithm chooses the coefficient on the regularizer based on the solution to an SDP
If we had a guess for the relevant Lipschitz constant, we could choose between the higher of the SDP value and the Lipschitz constant
Caveat: Slightly different handling when d is even

1 of 32

2 of 32

3 of 32

4 of 32

5 of 32

6 of 32

7 of 32

8 of 32

9 of 32

10 of 32

11 of 32

12 of 32

13 of 32

14 of 32

15 of 32

16 of 32

17 of 32

18 of 32

19 of 32

20 of 32

21 of 32

22 of 32

23 of 32

24 of 32

25 of 32

26 of 32

27 of 32

28 of 32

29 of 32

30 of 32

31 of 32

32 of 32