Large Scale Adversarial Representation Learning

GANs can be used for unsupervised learning where a generator maps latent samples to generate data, but this framework does not include an inverse mapping from data to latent representation. BiGAN adds an encoder E to the standard generator-discriminator GAN architecture — the encoder takes input data x and outputs a latent representation z of the input.



By Most Husne Jahan, Robert Hensley, Gurinder Ghotra

This post is part of the "superblog" that is the collective work of the participants of the GAN workshop organized by Aggregate Intellect. This post serves as a proof of work, and covers some of the concepts covered in the workshop in addition to advanced concepts pursued by the participants.

 

Papers Referenced: :

 

1. Comparison of BiGAN, BigGAN, and BigBiGAN

 

BiGAN: Bidirectional Generative Adversarial Networks (BiGANs)

 

Figure 1: The structure of Bidirectional Generative Adversarial Networks (BiGAN).

GANs can be used for unsupervised learning where a generator maps latent samples to generate data, but this framework does not include an inverse mapping from data to latent representation.

BiGAN adds an encoder E to the standard generator-discriminator GAN architecture — the encoder takes input data x and outputs a latent representation z of the input. The BiGAN discriminator D discriminates not only in data space (x versus G(z)), but jointly in data and latent space (tuples (x, E(x)) versus (G(z), z)), where the latent component is either an encoder output E(x) or a generator input z.

 

BigGAN: LARGE SCALE GAN TRAINING FOR HIGH FIDELITY NATURAL IMAGE SYNTHESIS

 
BigGAN essentially allows scaling of traditional GAN models. This results in GAN models with more parameters (e.g. more feature maps), larger batch sizes, and architectural changes. The BigGAN architecture also introduces a “truncation trick” used during image generation which results in an improvement in image quality. A specific regularization technique is used to support this trick. For image synthesis use cases, truncation trick involves using a different distribution of samples for the generator’s latent space during training than during inference. This “truncation trick” is a Gaussian distribution during training, but during inference a truncated Gaussian is used - where values above a given threshold are resampled. The resulting approach is capable of generating larger and higher-quality images than traditional GANs, such as 256×256 and 512×512 images. The authors proposed a model (BigGAN) with modifications focused on the following aspects:



Figure 2: Summary of the Self-Attention Module Used in the Self-Attention GAN

Figure

 

Table
Figure 3: Sample images generated by BigGANs

 

BigBiGAN – bi-directional BigGAN: Large Scale Adversarial Representation Learning

 
(Unsupervised Representation Learning)

Researchers introduced BigBiGAN which is built upon the state-of-the-art BigGAN model, extending it to representation learning by adding an encoder and modifying the discriminator. BigBiGAN is a combination of BigGAN and BiGAN which explores the potential of GANs for a wide range of applications, like unsupervised representation learning and unconditional image generation.

It has been shown that BigBiGAN (BiGAN with BigGAN generator) matches the state of the art in unsupervised representation learning on ImageNet. The authors proposed a more stable version of the joint discriminator for BigBiGAN, compared to the discriminator used previously. They also have shown that the representation learning objective also helps unconditional image generation.



Figure 4: An annotated illustration of the architecture of BigBiGAN. The red section is derived from BiGAN, whereas the blue sections are based on the BigGAN structure with the modified discriminators

The above figure shows the structure of the BigBiGAN framework, where a joint discriminator D is used to compute the loss. Its inputs are data-latent representation pairs, either (x ∼ Px, z ~ E(x)), sampled from the data distribution Px​ and encoder E outputs, or (x ∼ G(z), z ∼ Pz), sampled from the generator G outputs and the latent distribution Pz​. The loss includes the unary data term Sx​ and the unary latent term Sz​, as well as the joint term Sxz​ which ties the data and latent distributions.


Figure 5: Selected reconstructions from an unsupervised BigBiGAN model

In summary, BigBiGAN represents progress in image generation quality that translates to substantially improved representation learning performance.

Ref:

  1. BiGAN Paper: https://arxiv.org/pdf/1605.09782.pdf
  2. BigBiGAN Paper: https://arxiv.org/pdf/1907.02544.pdf
  3. BigGAN Paper: https://arxiv.org/pdf/1809.11096.pdf

 

2. Ablation study conducted for BigBiGAN:

 
As an ablation study, different elements in the BigBiGAN architecture were removed in order to better understand the effects of the respective elements. The metrics used for the study were IS and FID scores. IS score measures convergence to major modes while FID score measures how well the entire distribution is represented. A higher IS score is considered to be better, whereas a lower FID score is considered better. The following points highlight the findings of the ablation study:

  1. Latent distribution Pz and stochastic E.

The study upholds the findings of BIG-GAN of using random sampling from the latent space z as a superior method.

  1. Unary loss terms:
  2. Removing both the terms is equal to using BI-GAN.
  3. Removing Sx​ leads to inferior results in classification as Sx​ represents the standard generator loss in the base GAN.
  4. Removing Sz​ does not have much impact on classification accuracy.
  5. Keeping only Sz has a negative impact on classification accuracy.

Divergence in the IS and FID score led to the postulation that the BIG-Bi-GAN may be forcing the generator to produce distinguishable outputs across the entire latent space, rather than collapsing large volumes of latent space into a single mode of data distribution.

  1. It would have been interesting to see how much improvement the unary terms impose with the reduction of generator from BIG-GAN to DCGAN, this change of generator would have conclusively shown their advantage.
  2. Table of IS and FID scores (with relevant scores highlighted):


Table 1: Results for variants of BigBiGAN, given in Inception Score (IS) and Fréchet Inception Distance (FID) of the generated images, and ImageNET top-1 classification accuracy percentage

 

3. Generator Capacity

 
They found that generator capacity was critical to the results. By reducing the generator’s capacity, the researchers saw a reduction in classification accuracy. The generator was changed from DCGAN to BIG-GAN, which is a key contributor to its success.

 

4. Comparison to Standard Big-GAN

 
BigBiGAN without the encoder and with only the Sx unary term was found to produce a worse IS metric and the same FID metric when compared to BIG-GAN. From this, the researchers postulated that the addition of the encoder and the new joint discriminator did not decrease the generated image quality as can be seen from the FID score. The reason for a lower IS score is attributed to reasons similar to the ones for Sz unary term (as in point 2 - Unary loss term).

 

5. Higher resolution input for Encoder with varying resolution output from Generator

 
Big-GAN uses

  1. Higher resolution for the encoder.
  2. Lower resolution for generator and discriminator.
  3. They experimented with varying resolution sizes for the encoder and the generator and concluded that an increase in the resolution of the generator with a fixed high resolution for the encoder improves performance.

Note: looking at the table (the relevant portion is highlighted) this seems to be the case only with IS and not with FID, which increases to 38.58 from 15.82 when we go from low resolution for the generator to high resolution.

 

6. Decoupled Encoder / Generator optimizer:

 
Changing the learning rate for the encoder dramatically improved training and the final representation. Using a 10X higher learning rate for the encoder while keeping the generator learning rate fixed led to better results.

 

7. BigBiGAN basic structures compared to the Standard GAN

 
At the heart of the standard GAN is a generator and a discriminator. The BigBiGAN expands on this, building on the work of BiGAN and BigGAN, to include an encoder and “unary” term discriminators (F and H) which are then jointly discriminated along the lines of “encoder vs generator” through the final discriminator (J). As a result of these additions, some natural model changes emerge.

 

Change in the discrimination paradigm

 
Where the standard GAN discriminates between ‘real’ and ‘fake’ inputs, the BigBiGAN shifts that paradigm slightly to discriminating between ‘encoder’ and ‘generator’. If you think about the model in terms of “real” and “fake” you might be tempted to think about the real latent space z as “real” and the fake latent space E(x) as “fake” – this is different than what they do, and is important to the reason why we should notice the shift towards encoder vs generator. From this point on, each discriminator will be seen as discriminating “encoder from generator” and no longer “real from fake.”

 

Other natural model changes that emerge from the addition of an encoder and unary terms

 
Since the generator attempts to generate images, and the encoder attempts to generate latent space (aka the “noise” in the standard GAN), the structure of the outputs are different shapes. The image shapes are handled similar to a DCGAN, and the latent space shapes are handled with linear layers like the original GAN. As a result, the F discriminator is a CNN that discriminates between encoder and generator images, while the H discriminator is a linear module that accepts a flattened input and discriminates between encoder and generator latent space.

After the first phase of discrimination, the outputs of F are flattened so they can be concatenated with H outputs, then F and H outputs are jointly fed into the final discriminator J. As such, J will then discriminate between the concatenated encoder values [ Foute, Foute ], and the concatenated generator values [ Foutg, Foutg ] which can also be written as [ F(x), H(E(x)) ] ( vs ) [ F(G(z)), H(z) ].

For scoring FH, and J - with F_out and H_out needing to be matrices that can be feed into J - reducing F_out and H_out to a scalar needs to be done after their respective discrimination. Off the back of this requirement emerges the terms Sx​, xz​ and z​. These are linear layers that simply reduce Sx(Fout), Sxz(Jout) and Sz(Hout) each to a scalar that can then be summed up (Sx + Sxz + Sz​) and scored.

Compared to the standard GAN that is discriminating real values from fake values: (x) from G(z), the BigBiGAN can be seen as similarly discriminating a group of encoder values from a group of generator values: (Sxe + Sxze + Sze) from (Sxg + Sxzg + Szg).

 
Original. Reposted with permission.

Related: