# [ Archived Post ] Week 1 CS294–158 Deep Unsupervised Learning (1/30/19)

Please note that this post is for my own educational purpose.

Two ways are for unsupervised learning → generative or self-supervised. (the label is made up from very clever way).

Why we should do this → need to get more from the data → cherry on the cake is RL → while the cake itself is unsupervised learning.

It is about finding patterns! → structure within the data. (generative, compression and good for the downstream task and much more).

GANS → was the start of → a lot of signs and now we are at the level where we can create really realistic images.

How each method generates super-resolution images. (as well as the zebra data). (BIG GAN → we can create high-resolution images as well!) → BIG GAN results are very strange looking images. (NVIDIA → is another large contributor to this field of generating images → as well as audios).

Even create videos for robotics and texts and more! (strange looking texts and videos) → even generate latex.

We measure via → bits per dimension → how to measure compression. (half number of bits needed) → there are lossless compression as well as lossy compression. (a downstream task for NLP → quickly do transfer learning) → increase performance. (vision task does a similar thing as well).

Do GANS → generate a new image or remembers everything? → tricky to measure → not that easy to prove. (generate a lot of variation). (a smooth interpolation means that the model is not memorizing).

The first family of Deep Generative models → very simple models → and these fail in high dimension.

We can create videos → applications include fake news or movies → to create new lectures and more! (Generative modeling problem → very nice practical problems → data compression is very related to likelihood-based models).

Compression → creates efficient codes → compressing well → generative model. (prediction → the less amount of data we need for storing → relates to this) → massive dataset → filters out some outliers → out of the distribution of natural images, how should we detect this?

Likelihood-based → models a distribution over the data → given a dataset → x → create a distribution. The P(x) → can be the probability of certain images. (a distribution model → input data and spits out a number between 0~1). (everything will be about discrete data). We want this to be useful, for actual compression in the real world. We can use this for outlier detection → if the probability is extremely low → that is an outlier. (sampling → some process that generates a random variable).

But the data is in high dimension → how can we do well on this? → the key thing that sets apart the traditional statistical methods vs deep learning methods. (a lot of possible methods → there are a lot of trade-offs). (we want the models to train quickly and be small in size → yet, we want it to generalize). The practical ones as well → usefulness in the downstream task.

The simplest method → is the histogram method. (very cool!). (this is a bad method but this is a start). The model is a collection of those samples divided by the whole sample. (this is a likelihood method).

Form a histogram → the lecturer draws on the board → cannot see what is going on.

We know the probability → but can we sample it? → from a histogram → Wow, there is a method to do this → a traditional sampling method. (this is great → we have a likelihood generative model).

This fails in high dimensions → so it is not really useful. (a large number of the histogram) → never get any idea what the distribution would look like outside of the data distribution. (1/dataset size) → this is useless → the model is the dataset → very poor generalization. (the dimension of the parameter must be small → we need good compression → function approximation).

So learn the weights → this is called designing a model. (the problem is now finding the correct theta → or the optimal theta).

Is one distribution kinda equal to another? → design some distance function → that works well → we need to make a lot of choices and more. (not an easy problem).

We want these models to work in the large dataset → and we want this model to work well. (the theta must reflect the distribution well).

Maximum likelihood → this works in practice → that is the reason why we use these models. (also, it measures how good the compression is).

This does! → SGD can be applied here → to minimize/maximize average. (optimization method) → this works in practice.

Key property → the NN → able to compute log probability as well as gradient → we need to get both of this right and correct. (DNN → this is what we are going to use → NN is good for this → but hard to design this). (the probability must sum up to one → very hard to deal with) → hard to design a NN that does this correctly. (more like a fundamental problem → energy models does not have this → but we deal with this in the training process and that is hard as well).

We reform the problem into conditional probability → very good → image classification.

This is easy to do and doable → any distribution p can be decomposed like something. (again, drawing in the board → cannot see). (joint distribution → can be modeled into conditional distribution). → this is called auto-regressive models → variable depends on the previous variables.

An autoregressive model with two random variables. (so the joint probability distribution now have changed into conditional).

This scales….well, kinda. The number of parameters grows linearly. (text generation → a sentence can be very long → how long does our model can be? → another problem) → so much problem related to this.

RNN → perfectly fits into auto-regressive model → conditional probability → this idea has been applied to many different applications. (such as text and images).

Now we are going to talk about new methods → new as 4 years old.

A deep generative → hybrid between many different fields. (the stuff that we are going to talk today → very recent methods).

Back to the lecture → very interesting property → add more layers → does not help at all → we have these type of characteristics. (they are clever ways to restructure deep supervised learning into some unsupervised methods).

RNN → vanishing gradient as well as cannot compute fast enough. (we need to have some parallelism)

Woo → super interesting → autoencoder with some mask → like dropout but deterministic. (super cool and interesting) → compute the conditional distribution altogether. (ordering and more). (at the end of the network → want to compute some conditional distribution → this can be done in one go!). (we want to mask the weight matrix).

What we have are MLP + masked certain weights. (the reason why we do this → very good optimization). (good for GPU). (mask the information flow → to make the conditional distribution work well). (MADE → why is it good for outlier detection? → we estimate the probability → and this number can help us → what is common vs what is not). (output nodes → scalers) → softmax. (generate a new sample? → we can inspect the model → to get a new x1 and x2 → can we start from x1? → no, we need to start from generating x2 → this is how we should do this.).

For new data sampling, → we have to do a lot of feedforward operation.

The implementation is easy as well → just like that then → we are done.

Negative log likelihood training result → how likely is the test data → is it real or not. (the dataset is MNIST).

The resulted data → on the right → nearest neighbor in the data set → able to generate novel samples from a limited amount of data.

Masked Temporal Convolution → this seems to be related to WAVE Net. For each of the output node, → we want it to be the conditional distribution. (the previous node should not depend on the previous node → this is not realistic for convolution).

The above slide is wrong → since there is no masking operation.

Wave Net → using dilated convolution → just add some zero values → the spacing rate exist → hence we are able to cover more range.

Wow → but an easier implementation can be seen above. (the padding explanation happens on the board → cannot see that).

Now we can extend this into → 2D data → very cool! → since images have the natural 2D structure → we need to take advantage of this. (convolution network → can be translation invariant → not really true).

One happening at one location is the same as another location → translation invariant → generalize better. (we want this). (just like any other models → we need to impose some ordering). Predicting pixel values → using the pixels before the original pixel.

We need to create the network architecture itself.

Mask it! → convolution filter with some mask → LOL → this works well. (the dependency is now broken). (The autoregressive impose is done well).

But what about the blind spots? → want to compute the next layer. (the shadow regions are a problem). (a certain part of the context the model does not see → this might not be true). (the model can be infinitely deep but → does a bad job → how about rotation? → this is possible → since we can train the model again and again → with rotated images).

Why do we go after all of this trouble for autoregressive models → we want the conditional distribution. (this seem to be very useful). (avoid → all probability sum to one). Why don’t we just do everything in 1D? → there is some structure within the image we want to take advantage of this. (we can do this in the 1D convolution).

A fix to the blind spot → we have two streams of convolution → one stream is horizontal stack and another stream is the vertical stack → so a clever solution! We want to grow the receptacle field in a vertical fashion → quite a challenge.

With clever padding, we can achieve this method → super interesting. Only the above information → we can do this by padding the convolution filter.

An improvement → is gated resnet → better architectures → , not just linear we are adding multiplicative. (the gated method → another stream of convolution). pretty common → and give better performance.

The score is measured as bits/dim → very low bit rate. (cool).

Another extension of the pixelCNN → this seems to go on and on. (autoregressive models are very interesting). (temporal dimension → very useful for time series prediction and more). (not all pixels are important → nearby pixels should be similar to one another). (nearby bins should more occur often → pixel CNN++ → getting more stronger theories). (temperature → depended on previous day’s temperature).