Programming Generative AI: From Variational Autoencoders to Stable Diffusion with PyTorch and Hugging Face [Released: 2/4/2025]
.MP4, AVC, 1280x720, 30 fps | English, AAC, 2 Ch | 18h 15m | 2.52 GB
Instructor: Jonathan Dinu
.MP4, AVC, 1280x720, 30 fps | English, AAC, 2 Ch | 18h 15m | 2.52 GB
Instructor: Jonathan Dinu
In this course, Jonathan Dinu—a dedicated educator, author, and speaker—presents an interactive tour of deep generative modeling. Learn how to train your own generative models from scratch to create an infinity of images. Discover how you can generate text with large language models similar to the ones that power applications like ChatGPT. Write your own text-to-image pipeline to understand how prompt- based generative models actually work. Plus, personalize large pretrained models like stable diffusion to generate images of novel subjects in unique visual styles. This course offers you an applied resource to complement any theoretical or conceptual knowledge you have.
Learning objectives
- Train a variational autoencoder with PyTorch to learn a compressed latent space of images.
- Define how to generate and edit realistic human faces with unconditional diffusion models and SDEdit.
- Use large language models such as GPT2 to generate text with Hugging Face Transformers.
- Perform text-based semantic image search using multimodal models such as CLIP.
- Program your own text-to-image pipeline to understand how prompt-based generative models such as Stable Diffusion actually work.
- Evaluate generative models, both qualitatively and quantitatively.
- Identify how to caption images using pretrained foundation models.
- Articulate how to generate images in a specific visual style by efficiently fine-tuning Stable Diffusion with LoRA.
- Create personalized AI avatars by teaching pretrained diffusion models new subjects and concepts with Dreambooth.
- Guide the structure and composition of generated images using depth- and edge- conditioned ControlNets.
- Perform near real-time inference with SDXL Turbo for frame-based video-to-video translation.