*D* **Discriminator**

*G* **Generator**

Takes examples to find parameters of a distribution *p*_{data}

*p*_{data} gives us *x*. But is unknown. We try to approximate it by *p*_{model}.

Estimate

p_{model}(\mathfbf{x})

given the examples we can get from Why do we must care about Generative models ?

- Test ability to work with
*n*-Dim probability distribution - Incorporate into RL. For time series, it can generate futur, and give more examples. Idem for IRL
- Can be trained with missing data
- Make ML work with multi-modal output
- Realistic sample generation

2 networks

First:

*x*sampled from data- Find a differentiable function
*D* - Try to make
*D*(*x*) near 1

Second:

- Take input noise
*z* - Find a differentiable function
*G* *x*sampled from the model*D**D*tries to make*D*(*G*(*z*)) near 0,*G*tries to make*G*(*D*(*z*)) near 1

TODO : see again

Trained on *z*, Generate *x*

**x** = *G*(**z**; **θ**^{(G)})

- Differentiable
- Trainable for any size of
*z* *z*must have higher dimension than*x*

**z** are latent variables, and produce **x** observables

Likelihood is the evaluation of the model *p*_{model} with its parameters *θ* using training examples.

If the examples match the distribution, then likelihood is big, and model valid.

Likelihood = \PI_i^m p_{model}(\mathbf{x^{(i)}; \theta})

We can go to the log space also. The goal is to **maximize** the likelihood by changing the parameters *θ* (or to change the model choosen)

Or other way to do is to **minimize** the KL divergence of *p*_{model} against empirical distribution

\hat{p}_{data}

*D*_{KL}(*p*_{data}(**x**)∥*p*_{model}(**x**; *θ*))

Train them by trying to find a Nash equilibrium

**Generator** Creates the samples by generalizing using original data (but not giving the original)

*G*(**z**; *θ*^{(G)})

**Discriminator** Examine samples in order to say whether if they are real or fake. Learns with supervised learning, with {0, 1} (real or fake) *D*(**x**; *θ*^{(D)})

Try to minimize *J*^{(G)}(*θ*^{(D)}; *θ*^{(G)})

The *Nash* equilibrium is the tupple (*θ*^{(D)}; *θ*^{(G)}), and is a local minimum for *J*^{(D)} with respect to *θ*^{(D)} and is a local minimum for *J*^{(G)} with respect to *θ*^{(G)}

For the discriminator, Cross entropy but trained on 2 subsets (real and generated):

J^{(D)}(\theta^{(D)}, \theta^{(G)}) = - \frac{1}{2} \mathbb{E}_{x \sim p_{data}} \log D(\mathbf{x}) - \frac{1}{2} \mathbb{E}_z \log (1 - D(G(z)))

It gaves an estimate of the ratio

\frac{p_{data(\mathbf{x})}}{p_{modem}(\mathbf{x})}

We set that it is a **zero sum game**, or **minimax**, so *J*^{(D)} = −*J*^{(G)} The **value function** *V*(*θ*^{(D)}, *θ*^{(G)})= − *J*^{(D)}(*θ*^{(D)}, *θ*^{(G)}) summarize the game.

*θ*^{(G)*} = *a**r**g**m**i**n*_{θG}*m**a**x*_{θD}*V*(*θ*^{(D)}, *θ*^{(G)})

- Discriminator try to maximize its ability to detect true/false, so decrease cost
*J*^{(D)}so increasing*V*. - Generator try to minimize this quantity.

Problem:

If generator minimize the same quantity, it will "generate" like initial data by overfitting. Discriminator will always win by rejecting. If generator make a big difference, easy to find what is good or not.

J^{(G)}(\theta^{(D)}, \theta^{(G)}) = - \frac{1}{2} \mathbb{E}_z \log D(G(z))

Allow to not inverse the sign.

**KL**

**Reverse KL**

TODO IDEA Learning triangle There is data coming with a noise. There are two adversarial network in parallel Each trying to lean about data, and to discriminate real or fake sample...

Allow to separate the distribution (ex: knowing if dog or house allow to not make a mix between)

Problem of the **collapse** / **Helvetica scenario** mode: The generator focus on one output only,