# Power Laws in Deep Learning 2: Universality

It is amazing that Deep Neural Networks display this Universality in their weight matrices, and this suggests some deeper reason for Why Deep Learning Works.

**By Charles Martin, Machine Learning Specialist**

Editor's note: You can read the previous post in this series,Power Laws in Deep Learning, here.

**Power Law Distributions in Deep Learning**

In a previous post, we saw that the Fully Connected (FC) layers of the most common pre-trained Deep Learning display power law behavior.Â Specifically, for each FC weight matrixÂ , we compute the eigenvaluesÂ Â of the correlation matrixÂ

For every FC matrix, the eigenvalue frequencies, or Empirical Spectral Density (ESD),Â can be fit to a power law

where the exponentsÂ Â all lie in

*Remarkably, the FC matrices all lie within the Universality Class of Fat Tailed Random Matrices!*

**Heavy Tailed Random Matrices**

We define a random matrix by defining a matrixÂ Â of sizeÂ , and drawing the matrix elementsÂ Â from a random distribution. We can choose a

- Gaussian Random Matrix:Â Â Â , whereÂ Â is a Gaussian distribution

or a

- Heavy Tailed Random Matrix:Â Â Â , whereÂ Â is aÂ power law distribution

In either case, Random Matrix Theory tells us what the asymptotic form of ESD should look like.Â But first, letâ€™s see what model works best.

**AlexNet FC3**

First, lets look at the ESDÂ Â for AlexNet for layer FC3, and zoomed in:

Recall that AlexNet FC3 fits a power law with exponent $\alpha\sim&bg=ffffff $ , so we also plot the ESD on a log-log scale

AlexNet Layer FC3 Log Log Histogram of ESD

Notice that the distribution is linear in the central region, and the long tail cuts off sharply.Â This is typical of the ESDs for the fully connected (FC) layers of the all the pretrained models we have looked at so far.Â We now askâ€¦

*What kind of Random Matrix would make a good model for this ESD ?*

**ESDs: Gaussian random matrices**

We first generate a few Gaussian Random matrices (mean 0, variance 1), for different aspect ratios Q,Â and plot the histogram of their eigenvalues.

N, M = 1000, 500 Q = N / M W = np.random.normal(0,1,size=(M,N)) # X shape is M x M X = (1/N)*np.dot(W.T,W) evals = np.linalg.eigvals(X) plot.hist(evals, bins=100,density=True)

Empirical Spectral Density (ESD) for Gaussian Random Matrices, with different Q values.

Notice that the shape of the ESD depends only on Q, and is tightly bounded; there is, in fact, effectively no tail at all to the distributions (except, perhaps, misleadingly for Q=1)

**ESDs: Power Laws and Log Log Histograms**

We can generate a heavy, or fat-tailed, random matrix as easily using the numpy Pareto function

W=np.random.pareto(mu,size=(N,M))

Heavy Tailed Random matrices have a very ESDs.Â Â They have very long tailsâ€“so long, in fact, that it is better to plot them on a log log Histogram

Do any of these look like a plausible model for the ESDs of the weight matrices of a big DNN, like AlexNet ?

- the smallest exponent,Â Â (blue), has a very long tail, extending over 11 orders of magnitude. This means the largest eigenvalues would beÂ .Â No realÂ
**W**Â would behave like this. - the largest exponent,Â Â (red), has a very compact ESD, resembling more the GaussianÂ
**W**s above. - the fat tailedÂ Â Â ESD (green), however, is just about right.Â The ESD is linear in the central region, suggesting a power law.Â It is a little too large for our eigenvalues , but the tail also cuts off sharply, which is expected for any finiteÂ
**W**Â .Â So we are close

**AlexNet FC3**

Lets overlay the ESDÂ of fat-tailed W with the actual empiricalÂ Â from AlexNet for layer FC3

We see a pretty good match to a Fat-tailed random matrix withÂ .

Turns out, there is something very special aboutÂ Â being in the range 2-4.

**Universality Classes:**

Random Matrix Theory predicts the shape of the ESD , in the asymptotic limit, for several kinds of Random Matrix, calledÂ *University Classes.*Â The 3 different values ofÂ Â each represent a different Universality Class:

In particular, if we drawÂ Â from any heavy tailed / power law distribution, the empirical (i.e. finite size) eigenvalue densityÂ Â is likewise a power law (PL), either globally, or at least locally.

What is more, the predicted ESDs have different, characteristic global and local shapes, for specific ranges ofÂ .Â Â And the amazing thing is that

*the ESDs of the fully connected (FC) layers of pretrained DNNs all resemble the ESDs of theÂ Fat-Tailed Universality Classes of Random Matrix Theory*

But this is a little tricky to show, because we need to show thatÂ Â we fit to the theoreticalÂ .Â We now look at the

**Relations betweenÂ Â andÂ **

RMT tells us that, forÂ , the ESD takes the limiting for

, where

And this works pretty well in practice for the Heavy Tailed Universality Class, forÂ .Â But for any finite matrix, as soon asÂ , the finite size effects kick in, and we can not naively apply the infinite limit result.

**Statistics of the maximum eigenvalue(s)**

RMT not only tells us about the shape of the ESD; it makes statements about the statistics of the edge and/or tails â€” the fluctuations in the maximum eigenvalueÂ .Â Specifically, we have

- Gaussian RMT:Â Â
- Fat Tailed RMT:Â Â

For standard, Gaussian RMT, theÂ Â (near the bulk edge) is governed by the famousÂ Tracy Widom.Â And forÂ , RMT is governed by theÂ Tau Four Moment Theorem.

But forÂ ,Â the tail fluctuations follow Frechet statistics, and the maximum eigenvalue has Power Law finite size effects

In particular, the effects of M and Q kick in as soon asÂ .Â If we underestimateÂ , (small Q, large M), the power law will look weaker, and we willÂ overestimateÂ Â in our fits.

And, for us, this affects how we estimateÂ Â fromÂ Â and assign the Universality Class

**Fat Tailed Matrices and the Finite Size Effects forÂ **

Here, we generate generate ESDs for 3 different Pareto Heavy tailed random matrices, with the fixed M (left) or N (right), but different Q.Â We fit each ESD to a Power Law.Â We then plotÂ , as fit, toÂ .

The red lines are predicted byÂ Heavy Tailed RMT (MP) theory, which works well for Heavy Tailed ESDs withÂ .Â For Fat Tails, withÂ , the finite size effects are difficult to interpret.Â The main take-away isâ€¦

*We can identify finite size matrices W that behave like the the Fat Tailed Universality Class of RMT ()Â with Power Law fits, even with exponentsÂ ,Â ranging upto 4 (and even upto 5-6).*

**Implications**

It is amazing that Deep Neural Networks display this Universality in their weight matrices, and this suggests some deeper reason forÂ Why Deep Learning Works.

**Self Organized Criticality**

In statistical physics,Â if a system displays a Power Laws, this can be evidence that it is operating near a critical point.Â It is known thatÂ real, spiking neurons display this behavior, calledÂ Self Organized Criticality

It appears that Deep Neural Networks may be operating under similar principles, and in future work, we will examine this relation in more detail.

The code for this post is inÂ this github repo on ImplicitSelfRegularization

For more information, see this recorded talk on this topic: **Why Deep Learning Works: Implicit Self-Regularization in Deep Neural Networks**

**Bio: Dr. Charles Martin** is a specialist in Machine Learning, Data Science, Deep Learning, and Artificial Intelligence. He helped develop Aardvark, a Machine Learning / NLP startup acquired by Google in 2010. He currently runs a boutique consulting firm specializing in software development, machine learning and AI. His clients include Wall Street firms, Big Pharma, Telecom, eCommerce, early and late stage startups, and the largest Internet companies such as eHow, eBay, GoDaddy, etc.

Original. Reposted with permission.

**Related:**

- Why Does Deep Learning Work?
- Why Do Deep Learning Networks Scale?
- 7 Steps to Mastering Deep Learning with Keras