1 Why generative modeling works

1.1 Vocabulary and notation

D Discriminator

G Generator

1.2 Generative models

Takes examples to find parameters of a distribution pdata

pdata gives us x. But is unknown. We try to approximate it by pmodel.


given the examples we can get from pdata

Why do we must care about Generative models ?

  • Test ability to work with n-Dim probability distribution
  • Incorporate into RL. For time series, it can generate futur, and give more examples. Idem for IRL
  • Can be trained with missing data
  • Make ML work with multi-modal output
  • Realistic sample generation

1.3 Adversarial Networks

2 networks


  • x sampled from data
  • Find a differentiable function D
  • Try to make D(x) near 1


  • Take input noise z
  • Find a differentiable function G
  • x sampled from the model
  • D
  • D tries to make D(G(z)) near 0, G tries to make G(D(z)) near 1

TODO : see again

1.4 Generator networks

Trained on z, Generate x

x = G(z; θ(G))

  • Differentiable
  • Trainable for any size of z
  • z must have higher dimension than x

z are latent variables, and produce x observables

2 How generative models work

2.1 likelihood

Likelihood is the evaluation of the model pmodel with its parameters θ using training examples.

If the examples match the distribution, then likelihood is big, and model valid.

Likelihood = \PI_i^m p_{model}(\mathbf{x^{(i)}; \theta})

We can go to the log space also. The goal is to maximize the likelihood by changing the parameters θ (or to change the model choosen)

Or other way to do is to minimize the KL divergence of pmodel against empirical distribution


DKL(pdata(x)∥pmodel(x; θ))

2.2 Training

Train them by trying to find a Nash equilibrium

2.3 Composition

Generator Creates the samples by generalizing using original data (but not giving the original)

G(z; θ(G))

Discriminator Examine samples in order to say whether if they are real or fake. Learns with supervised learning, with {0, 1} (real or fake) D(x; θ(D))

Try to minimize J(G)(θ(D); θ(G))

The Nash equilibrium is the tupple (θ(D); θ(G)), and is a local minimum for J(D) with respect to θ(D) and is a local minimum for J(G) with respect to θ(G)

2.4 Cost functions J

For the discriminator, Cross entropy but trained on 2 subsets (real and generated):

J^{(D)}(\theta^{(D)}, \theta^{(G)}) = - \frac{1}{2} \mathbb{E}_{x \sim p_{data}} \log D(\mathbf{x}) - \frac{1}{2} \mathbb{E}_z \log (1 - D(G(z)))

It gaves an estimate of the ratio


We set that it is a zero sum game, or minimax, so J(D) = −J(G) The value function V(θ(D), θ(G))= − J(D)(θ(D), θ(G)) summarize the game.

θ(G)* = argminθGmaxθDV(θ(D), θ(G))

  • Discriminator try to maximize its ability to detect true/false, so decrease cost J(D) so increasing V.
  • Generator try to minimize this quantity.


If generator minimize the same quantity, it will "generate" like initial data by overfitting. Discriminator will always win by rejecting. If generator make a big difference, easy to find what is good or not.

J^{(G)}(\theta^{(D)}, \theta^{(G)}) = - \frac{1}{2} \mathbb{E}_z \log D(G(z))

Allow to not inverse the sign.


Reverse KL

TODO IDEA Learning triangle There is data coming with a noise. There are two adversarial network in parallel Each trying to lean about data, and to discriminate real or fake sample...

3 How GAN are better than previous model

4 Tips and Tricks

4.1 Train with labels

Allow to separate the distribution (ex: knowing if dog or house allow to not make a mix between)

Problem of the collapse / Helvetica scenario mode: The generator focus on one output only,

5 Details of GAN

6 Research of GAN

7 Source

ArXiv Slides NIPS