1 of 1

PoE family with equilibrium

GD jumps across segments

Ensures persistently exciting
PoE ⇒ GES of LTV systems

(GES is stable fast convergence)

Motivation

Overcoming Practical Challenges

Experimental Results

PoE of SGD and its variants; PoE in Reinforcement Learning

Problem Formulation

Find sufficient conditions for every pair of consecutive kth-step GD updates to lie on a discretized trajectory from a reference persistently excited CT family with GES equilibrium at unknown true parameters .

Future Work

“tank”, 63.0%

“airplane”, 92.5%

Approach

Choose a reference family of PoE trajectories
Prove sufficient conditions for GD to lie on PoE trajectories.

Key Idea

noise

Improving Neural Network Robustness with Persistency of Excitation

Kaustubh Sridhar, Oleg Sokolsky, Insup Lee, James Weimer

Deep learning is a parameter estimation problem

Persistency of excitation (PoE) is a integral parameter estimation technique to increase robustness

Key Insight

Gradient descent (GD) dynamics can be modeled as a sampling of an adaptive continuous-time linear time-varying (LTV) system.
Allows us to prove PoE of GD for more than just 2-layer networks in [Nar and Sastry 2019].

Sufficient Conditions for PoE of GD

Assumption 1: - smooth loss functions (common)
Assumption 2: Acuteness of descent directions (intuitive, monitor)

Theorem: We have PoE of GD when training a model via GD with a learning schedule by minimizing a -smooth loss function if for all k.

is full rank.

Scale (given) baseline schedule to obtain PoE-motivated schedule and (empirically motivated) largest convergent schedule with initial values

Estimating a certified Lipschitz constant in baseline with Extreme Value Theory.
Monitor Assumption 2 in baseline; Tune batch size

and

Our schedules beat the state-of-the-art in standard and adversarial training.

Presented at the American Control Conference, 2022.

Title of your project

Names of contributors

Robust Concept Learning and Lifelong Adaptation Against Adversarial Attacks: ARO MURI W911NF2010080

Adversarial examples are common for deep learning ( noise added to image of tank is now classified as airplane). At its core, DL is a param estimation problem. PoE is integral to parameter estimation robustness. Our Key insight is that GD can be modeled as a sampling of an adaptive continuous-time linear time-varying (LTV) system (in GIF of gradient descent, if we rename true labels → true output, error → loss and replace GD dynamics with parameter update dynamics and sampling term, we have parameter estimation from adaptive control!). Our Key insight allows us to prove PoE of GD for any model and not restricted like Sastry.

We do so by flipping the analysis around. We first choose areference family of nice (PoE) trajectories. Then we find sufficient conditions for GD to lie on these trajectories. This is our probem: find sufficient conditions for consecutive GD updates to lie on trajectories from our chosen PoE family.

The GIF in the second column explains our key idea. The dotted lines represent our chosen PoE trajectories converging to the (unknown) true optima. The straight blue line represents GD updates that jump across segments on these trajectories (see solid curved segments in the GIF). Many such GD jumps later, we reach the true params. The whole hybrid system itself is PoE!

What are these awesome sufficient conditions? They are very simple. You just need to choose your learning rates less than 1/smoothness constant (lipschitz constant of the derivative of the loss). Yes, we had to make 2 assumptions: that the loss is smooth and that descent is acutely directed. The first assumption is common in optimization theory. The second one is intuitive and we monitor it. When monitored, it is satisfied for larger batch sizes.

We overcome some practical challenges to transfer our theory on GD to practical Stochastic GD. Namely, we change some baseline learnignrate schedule by changing the first value according to our theory and letting all of the rest be in the same proportion of the first as before.

Our experiments demonstrate that by choosing our learning rate (and accompanying large batch sizes), we can have universal improvements in robustness for both standard and adversarially trained models.

Thank you.