Is depth useful for selfattention?
Learn about recent research that is the first to explain a surprising phenomenon where in BERT/Transformerlike architectures, deepening the network does not seem to be better than widening (or, increasing the representation dimension). This empirical observation is in contrast to a fundamental premise in deep learning.
By Yoav Levine, Noam Wies, Or Sharir, Hofit Bata and Amnon Shashua, Hebrew University, Jerusalem.
In a nutshell: In our new paper, we prove that a doubleexponential depthefficiency takes place in selfattention networks, while at the same time we pinpoint a transition, at depth L=log_{3}(d_{x}), in which the capacity of the selfattention width d_{x} (the representation dimension) to support this efficiency exhausts. Our predictions strongly accord with extensive empirical ablations in Kaplan et al., accounting for the different behaviors in the two depthefficiency/inefficiency regimes. Pointing at the network’s width as a limiting factor, we predict that solutions for dramatically increasing the width (model parallelism, etc.) can facilitate the next leap in selfattention expressivity.
Background: Depth is less crucial in selfattention
The golden age of deep learning has popularized the depthefficiency notion: from an expressiveness standpoint, increasing a neural network's size by adding more layers (deepening) is advantageous relative to other parameter increase alternatives, such as increasing the dimension of the internal representation (widening). Beyond overwhelming empirical signals for this notion, depthefficiency was theoretically supported from a variety of angles. Diminishing returns in the case of very deep networks were mainly attributed to optimization issues, and indeed alleviating these issues allowed network depths to mount from 10s to 100s and beyond, allowing for deep convolutional networks (ConvNets) to advance the stateoftheart in computer vision applications.
Since the introduction of the Transformer, along with its encoderonly variant, BERT, selfattention based deep learning architectures have taken over the field of natural language processing. However, in contrast to the depth "arms race" that took place in the ConvNet case, the leading selfattention networks are not much deeper than the original depth12 BERTbase model. In fact, the strongest selfattention model trained to date, T5, has increased the parameter count of BERTbase by a factor of 100, while only increasing its depth by a factor of 4. The remaining size increase stems from an increase in layer widths, clearly countering the depthefficiency notion.
A recent extensive empirical ablation study by Kaplan et al. provides systematic support for the above signal. Figure 1 above, taken from this study, shows that the overall (nonembedding) network size, given by 12⋅L⋅d^{2}_{x} where L is the number of selfattention layers (network depth), and d_{x} is the hidden representation dimension (network width), is the main predictor of performance, regardless of the depth to width ratio. Experiments along the L>6 (yellow) curve include selfattention networks of depths from L=12 to L=200, all approximately obeying the same improvement trend, which depends only on network size. This suggests that depth does not play as crucial a role in selfattention networks as it does in convolutional networks.
Our finding: Network width caps the benefits of depth in selfattention
In our new work, we theoretically address the above question of the depth to width tradeoff in selfattention networks and reveal fundamental subtleties in the above picture. Rather than reinforcing the seemingly plausible hypothesis for the trend in the above figure, by which widening a selfattention network is as effective as deepening it, we confirm the contrary. We show that the operation of stacking selfattention layers is so effective that it quickly saturates the capacity of the network's width.
Specifically, we establish the existence of a depth threshold, which depends logarithmically on the width d_{x}, denoted L_{th}(d_{x})=log_{3}(d_{x}). Below the threshold, we prove that doubleexponential depthefficiency takes place in selfattention networks:
Informal theorem: A selfattention network of depth that is under log_{3}(d_{x}) can only be replicated by a shallower network if the latter is wider by a factor that is doubleexponential in the depth ratio.
In the other regime, above the threshold, we establish a completely different behavior for the operation of stacking selfattention layers:
Informal theorem: For selfattention networks of depth that is over log_{3}(d_{x}), width and depth contribute similarly to network expressivity.
A closer observation of the experimental ablation in Kaplan et al., displayed in Figure 2 below, reveals an agreement with our theoretical indications. The figure shows that while for L≤6, there is an advantage for depth, and L>6, it completely disappears. When assigning actual width values which range around d_{x}=1000, our theoretical threshold for depthefficiency agrees with empirical findings, as L_{th}(d_{x})≃6.3.
Practical derivatives
The clear boundaries drawn between the two regimes suggest always to exploit any parameter budget of 12⋅L⋅d^{2}_{x} such that depth does not fall below the threshold of log_{3}(d_{x}). In this case, we have shown a clear disadvantage in the expressiveness of shallower networks. The table below contains the minimal depths per parameterbudget by these considerations, which accord with empirical evidence in Figure 2. Such insights may prove useful, given the rapid increase in model sizes.
Moreover, the observation that width is the limiting factor for depthefficiency promotes the development of methods for dramatically increasing it in selfattention architectures. The successful ALBERT, a BERT variant that shares parameters between selfattention layers, allows for wider models to be trained for the same budget. For a more significant increase, that addresses the question of computation efficiency, we point at the concept employed in ShuffleNet, which has proved to be very efficient for convolutional networks. They suggest increasing the representation dimension while using only a fraction of it for computation in each layer. This way, the computation costs are contained, but the theoretical limitations posed by our work are relaxed. Generally, width increases have greater potential for speeding up network inference and training because it can be parallelized, as opposed to the depth which yields a sequential computation. Our theoretical indication that the contribution of depth and width is indeed on the same order and that moreover, width limits the ability to enjoy depth efficiency, which may motivate the development of further model parallelism methods for Transformers.
Check out all of the details in Limits to Depth Efficiencies of SelfAttention.
Original. Reposted with permission.
Related:
Top Stories Past 30 Days

