# 1 Why generative modeling works

D Discriminator

G Generator

## 1.2 Generative models

Takes examples to find parameters of a distribution pdata

pdata gives us x. But is unknown. We try to approximate it by pmodel.

Estimate

p_{model}(\mathfbf{x})
given the examples we can get from pdata

Why do we must care about Generative models ?

• Test ability to work with n-Dim probability distribution
• Incorporate into RL. For time series, it can generate futur, and give more examples. Idem for IRL
• Can be trained with missing data
• Make ML work with multi-modal output
• Realistic sample generation

2 networks

First:

• x sampled from data
• Find a differentiable function D
• Try to make D(x) near 1

Second:

• Take input noise z
• Find a differentiable function G
• x sampled from the model
• D
• D tries to make D(G(z)) near 0, G tries to make G(D(z)) near 1

TODO : see again

## 1.4 Generator networks

Trained on z, Generate x

x = G(z; θ(G))

• Differentiable
• Trainable for any size of z
• z must have higher dimension than x

z are latent variables, and produce x observables

# 2 How generative models work

## 2.1 likelihood

Likelihood is the evaluation of the model pmodel with its parameters θ using training examples.

If the examples match the distribution, then likelihood is big, and model valid.

Likelihood = \PI_i^m p_{model}(\mathbf{x^{(i)}; \theta})

We can go to the log space also. The goal is to maximize the likelihood by changing the parameters θ (or to change the model choosen)

Or other way to do is to minimize the KL divergence of pmodel against empirical distribution

\hat{p}_{data}

DKL(pdata(x)∥pmodel(x; θ))

## 2.2 Training

Train them by trying to find a Nash equilibrium

## 2.3 Composition

Generator Creates the samples by generalizing using original data (but not giving the original)

G(z; θ(G))

Discriminator Examine samples in order to say whether if they are real or fake. Learns with supervised learning, with {0, 1} (real or fake) D(x; θ(D))

Try to minimize J(G)(θ(D); θ(G))

The Nash equilibrium is the tupple (θ(D); θ(G)), and is a local minimum for J(D) with respect to θ(D) and is a local minimum for J(G) with respect to θ(G)

## 2.4 Cost functions J

For the discriminator, Cross entropy but trained on 2 subsets (real and generated):

J^{(D)}(\theta^{(D)}, \theta^{(G)}) = - \frac{1}{2} \mathbb{E}_{x \sim p_{data}} \log D(\mathbf{x}) - \frac{1}{2} \mathbb{E}_z \log (1 - D(G(z)))

It gaves an estimate of the ratio

\frac{p_{data(\mathbf{x})}}{p_{modem}(\mathbf{x})}

We set that it is a zero sum game, or minimax, so J(D) = −J(G) The value function V(θ(D), θ(G))= − J(D)(θ(D), θ(G)) summarize the game.

θ(G)* = argminθGmaxθDV(θ(D), θ(G))

• Discriminator try to maximize its ability to detect true/false, so decrease cost J(D) so increasing V.
• Generator try to minimize this quantity.

Problem:

If generator minimize the same quantity, it will "generate" like initial data by overfitting. Discriminator will always win by rejecting. If generator make a big difference, easy to find what is good or not.

J^{(G)}(\theta^{(D)}, \theta^{(G)}) = - \frac{1}{2} \mathbb{E}_z \log D(G(z))

Allow to not inverse the sign.

KL

Reverse KL

TODO IDEA Learning triangle There is data coming with a noise. There are two adversarial network in parallel Each trying to lean about data, and to discriminate real or fake sample...

# 4 Tips and Tricks

## 4.1 Train with labels

Allow to separate the distribution (ex: knowing if dog or house allow to not make a mix between)

Problem of the collapse / Helvetica scenario mode: The generator focus on one output only,