1 Why generative modeling works

1.1 Vocabulary and notation

D Discriminator

G Generator

1.2 Generative models

Takes examples to find parameters of a distribution p_data

p_data gives us x. But is unknown. We try to approximate it by p_model.

Estimate

p_{model}(\mathfbf{x})

given the examples we can get from p_data

Why do we must care about Generative models ?

Test ability to work with n-Dim probability distribution
Incorporate into RL. For time series, it can generate futur, and give more examples. Idem for IRL
Can be trained with missing data
Make ML work with multi-modal output
Realistic sample generation

1.3 Adversarial Networks

2 networks

First:

x sampled from data
Find a differentiable function D
Try to make D(x) near 1

Second:

Take input noise z
Find a differentiable function G
x sampled from the model
D
D tries to make D(G(z)) near 0, G tries to make G(D(z)) near 1

TODO : see again

1.4 Generator networks

Trained on z, Generate x

x = G(z; θ^(G))

Differentiable
Trainable for any size of z
z must have higher dimension than x

z are latent variables, and produce x observables

2 How generative models work

2.1 likelihood

Likelihood is the evaluation of the model p_model with its parameters θ using training examples.

If the examples match the distribution, then likelihood is big, and model valid.

Likelihood = \PI_i^m p_{model}(\mathbf{x^{(i)}; \theta})

We can go to the log space also. The goal is to maximize the likelihood by changing the parameters θ (or to change the model choosen)

Or other way to do is to minimize the KL divergence of p_model against empirical distribution

\hat{p}_{data}

D_KL(p_data(x)∥p_model(x; θ))

2.2 Training

Train them by trying to find a Nash equilibrium

2.3 Composition

Generator Creates the samples by generalizing using original data (but not giving the original)

G(z; θ^(G))

Discriminator Examine samples in order to say whether if they are real or fake. Learns with supervised learning, with {0, 1} (real or fake) D(x; θ^(D))

Try to minimize J^(G)(θ^(D); θ^(G))

The Nash equilibrium is the tupple (θ^(D); θ^(G)), and is a local minimum for J^(D) with respect to θ^(D) and is a local minimum for J^(G) with respect to θ^(G)

2.4 Cost functions J

For the discriminator, Cross entropy but trained on 2 subsets (real and generated):

J^{(D)}(\theta^{(D)}, \theta^{(G)}) = - \frac{1}{2} \mathbb{E}_{x \sim p_{data}} \log D(\mathbf{x}) - \frac{1}{2} \mathbb{E}_z \log (1 - D(G(z)))

It gaves an estimate of the ratio

\frac{p_{data(\mathbf{x})}}{p_{modem}(\mathbf{x})}

We set that it is a zero sum game, or minimax, so J^(D) = −J^(G) The value function V(θ^(D), θ^(G))= − J^(D)(θ^(D), θ^(G)) summarize the game.

θ^(G)* = argmin_θGmax_θDV(θ^(D), θ^(G))

Discriminator try to maximize its ability to detect true/false, so decrease cost J^(D) so increasing V.
Generator try to minimize this quantity.

Problem:

If generator minimize the same quantity, it will "generate" like initial data by overfitting. Discriminator will always win by rejecting. If generator make a big difference, easy to find what is good or not.

J^{(G)}(\theta^{(D)}, \theta^{(G)}) = - \frac{1}{2} \mathbb{E}_z \log D(G(z))

Allow to not inverse the sign.

Reverse KL

TODO IDEA Learning triangle There is data coming with a noise. There are two adversarial network in parallel Each trying to lean about data, and to discriminate real or fake sample...

3 How GAN are better than previous model

4 Tips and Tricks

4.1 Train with labels

Allow to separate the distribution (ex: knowing if dog or house allow to not make a mix between)

Problem of the collapse / Helvetica scenario mode: The generator focus on one output only,

5 Details of GAN

6 Research of GAN

7 Source

ArXiv Slides NIPS