For NLP generative, like GPT, please check https://github.com/rmgogogo/nano-transformers
Here this repo more on generatives. GPT still may be tried here.
This repo uses PyTorch.
python vae.py --train --epochs 10 --predict
python cvae.py --train --epochs 10 --predict
python diffusion.py --train --epochs 100 --predict
Mac Mini M1 takes around 1 hour (1:17:16).
python conditional_diffusion.py --train --epochs 100 --predict
python clip.py --train --epochs 10 --predict
A pro version of CLIP. It uses the BERT text encoder with real text. Since this is a nano image VAE, while BERT encoder generates 768-d vector, and we only have 10 ditigals, it has high prob to contain same digital in one batch, then the CLIP's loss can't work well. Using small batch would help but small batch has its own problem. So the performance is not good. However it's good enough as a demo to tell the essience.
python clip_pro.py --train --epochs 10 --predict
python vqvae.py --train --epochs 100 --predict
Codebook size is 32, here display the whole possibilites. This sample VQ the whole z, in real case, it VQ the parts.
The initial codebook:
The learned codebook:
50 times faster.
python diffusion.py --predict --ddim
python conditional_diffusion.py --predict --ddim
Based on vae with latent 8, it do diffusion in latent space. However since the latent space already is noise-make-sense and high compressed (8 numbers), the diffusion in latent didn't work well as expected. It's mainly for demo purpose.
Gan with a simple conv net, so it's DCGAN.
Split image into 4x4 smaller images, so we have 7x7 patches.
Train VQ VAE for the patches.
It's like tokenizer to give each patch an identifier. So image can be represented as a 7x7 sequence. Later we can implement ViT based on it.
Compare the Patches VQ VAE with VQ-VAE or VAE, we would find that image is more sharp. However in the boundary of the two patches, we may need to do some additional low-band filtering to make it be more smooth.
The codebook is trained and looks good.
GPT2 based on a toy dataset (simple math).
python gpt2.py --train --epochs 400 --predict --input "1 + 1 ="
python llama.py --train --epochs 400 --predict --input "1 + 1 ="
python gemma.py --train --epochs 400 --predict --input "1 + 1 ="
(1)
Split image into patches, VQ the patch to tokenize the image into tokens (distinct) and then get its vector via embedding. Train GPT to predict the tokens, which finally generates the image. Diffusion Transformer.
https://arxiv.org/pdf/2212.09748.pdf
(2)
Split image into patches via using Conv to get token vectors directly.