1 Why generative modeling works

1.1 Vocabulary and notation

D Discriminator

G Generator

1.2 Generative models

Takes examples to find parameters of a distribution pdata

pdata gives us x. But is unknown. We try to approximate it by pmodel.

Estimate

p_{model}(\mathfbf{x})
given the examples we can get from pdata

Why do we must care about Generative models ?

  • Test ability to work with n-Dim probability distribution
  • Incorporate into RL. For time series, it can generate futur, and give more examples. Idem for IRL
  • Can be trained with missing data
  • Make ML work with multi-modal output
  • Realistic sample generation

1.3 Adversarial Networks

2 networks

First:

  • x sampled from data
  • Find a differentiable function D
  • Try to make D(x) near 1

Second:

  • Take input noise z
  • Find a differentiable function G
  • x sampled from the model
  • D
  • D tries to make D(G(z)) near 0, G tries to make G(D(z)) near 1

TODO : see again

1.4 Generator networks

Trained on z, Generate x

x = G(z; θ(G))

  • Differentiable
  • Trainable for any size of z
  • z must have higher dimension than x

z are latent variables, and produce x observables

2 How generative models work

2.1 likelihood

Likelihood is the evaluation of the model pmodel with its parameters θ using training examples.

If the examples match the distribution, then likelihood is big, and model valid.

Likelihood = \PI_i^m p_{model}(\mathbf{x^{(i)}; \theta})

We can go to the log space also. The goal is to maximize the likelihood by changing the parameters θ (or to change the model choosen)

Or other way to do is to minimize the KL divergence of pmodel against empirical distribution

\hat{p}_{data}

DKL(pdata(x)∥pmodel(x; θ))

2.2 Training

Train them by trying to find a Nash equilibrium

2.3 Composition

Generator Creates the samples by generalizing using original data (but not giving the original)

G(z; θ(G))

Discriminator Examine samples in order to say whether if they are real or fake. Learns with supervised learning, with {0, 1} (real or fake) D(x; θ(D))

Try to minimize J(G)(θ(D); θ(G))

The Nash equilibrium is the tupple (θ(D); θ(G)), and is a local minimum for J(D) with respect to θ(D) and is a local minimum for J(G) with respect to θ(G)

2.4 Cost functions J

For the discriminator, Cross entropy but trained on 2 subsets (real and generated):


J^{(D)}(\theta^{(D)}, \theta^{(G)}) = - \frac{1}{2} \mathbb{E}_{x \sim p_{data}} \log D(\mathbf{x}) - \frac{1}{2} \mathbb{E}_z \log (1 - D(G(z)))

It gaves an estimate of the ratio

\frac{p_{data(\mathbf{x})}}{p_{modem}(\mathbf{x})}

We set that it is a zero sum game, or minimax, so J(D) = −J(G) The value function V(θ(D), θ(G))= − J(D)(θ(D), θ(G)) summarize the game.

θ(G)* = argminθGmaxθDV(θ(D), θ(G))

  • Discriminator try to maximize its ability to detect true/false, so decrease cost J(D) so increasing V.
  • Generator try to minimize this quantity.

Problem:

If generator minimize the same quantity, it will "generate" like initial data by overfitting. Discriminator will always win by rejecting. If generator make a big difference, easy to find what is good or not.


J^{(G)}(\theta^{(D)}, \theta^{(G)}) = - \frac{1}{2} \mathbb{E}_z \log D(G(z))

Allow to not inverse the sign.

KL

Reverse KL

TODO IDEA Learning triangle There is data coming with a noise. There are two adversarial network in parallel Each trying to lean about data, and to discriminate real or fake sample...

3 How GAN are better than previous model

4 Tips and Tricks

4.1 Train with labels

Allow to separate the distribution (ex: knowing if dog or house allow to not make a mix between)

Problem of the collapse / Helvetica scenario mode: The generator focus on one output only,

5 Details of GAN

6 Research of GAN

7 Source

ArXiv Slides NIPS