There's been a lot of hype about AI in our industry (and the world at large) in the last few years obviously.
This page is a collection of my personal notes as I try to understand this space - read at your own peril 😭
I can only promise a few things:
I'll keep updating this page as I randomly come to understand different terms and concepts
What I write will be accurate to the best of my knowledge. I have worked in software engineering for over 10 years so I'm not a complete noob, but then again I have not specialized in this field either so there may be some inaccuracies. Think of this as a slightly technical and shorter version of Artificial Intelligence for Dummies - and feel free to correct me.
Let's get started 💪🏼
This guidance scale is a parameter from 1 to 20 which controls how much the image generation AI model follows your given text prompt. The higher the value, the more closely the output image follows your text input. Generally speaking, it's recommended to use 7 - 9, and increase up to 15 if the generated image strays too far from the prompt.
In image-generation models like Stable Diffusion, checkpoints are pre-trained versions of the underlying model. They are used as starting points for the model to generate images, and are the result of training on specific image datasets.
For example, training the base Stable Diffusion model on a set of fantasy art will produce a checkpoint which can be used to generate images in the same style.
ControlNet is a relatively recent (2023) architecture which lets users "steer" or control the output of a text to image model like Stable Diffusion by combining the typical text prompt/input with another image input.
The second image input guides the diffusion model when generating its output, together with the text input. For example, ControlNet can:
use a user's scribble or sketch to place objects in the image output
use a pose input to generate images of a man, woman, or child in the same pose
use a depth map to generate images of an interior space with different wood or marble materials
Diffusion (more specifically, latent diffusion) is the technique used by the popular text-to-image AI models like Stable Diffusion.
In this technique, the model starts with a randomly generated image which is pure "noise". The model then predicts how much of this is "noise" compared to its target output (eg an image of a dog) and subtracts that noise (ie denoising). After iteratively subtracting noise over a number of steps, the generated image should be similar to the expected target output.
DreamBooth is a training technique to fine-tune any text to image model into generating images of a specific subject. By supplying a set of training images labelled both with an instance of a subject (eg "John") as well as the class of the subject (eg "person"), the underlying model learns to generate further images using the same subject.
Fine-tuning is the process of further training a base model like Stable Diffusion, by feeding it a new dataset of more images in a particular style or subject. This results in a Checkpoint which will mimic this newly fine-tuned style.
The 3 fine-tuning methods are:
Checkpoint training
LoRA
Named after the AI researcher Tero Karras, you'll usually see this term as a label/modifier for different samplers used in image generation models. Samplers using the Karras noise schedule use larger noise step sizes in initial sampling steps and smaller step sizes near the end, resulting in improved image quality.
LangChain is the leading open source framework for building workflows (ie "chaining" various steps) for LLMs. The framework uses LangChain Expression Language (LCEL) to specify intermediate steps for an LLM to take before returning output to the user, such as retrieving specific data from a database or translating into a target language.
LoRA models are base models which have been fine-tuned in a specific way, by modifying only a fraction of the weights used in the model to better suite a specific/smaller training dataset.
LoRA training is a faster, more efficient method for customizing a base model compared to checkpoint training:
is typically much faster than full checkpoint training (8 minutes vs 20 minutes)
produces smaller files as the output (2 - 200 Mb vs several Gigabytes)
Retrieval-Augmented Generation is the process for improving the output of a generic LLM by asking it to reference a predetermined set of vetted data in any answers, rather than hallucinating and presenting incorrect information.
Samplers are a necessary component in diffusion-based image generation models. They are responsible for each denoising iteration, which removes noise from an original random image towards the intended prompt output.
Each step / iteration for removing noise is referred to as a sampling step.
There are different samplers available, which use different algorithms to determine how to remove noise from an image. Some samplers are optimized for speed, while others are better suited for image <> prompt similarity.
Some common samplers in use are:
Euler - the simplest / fastest method, but not as accurate.
DPM (Denoising Diffusion Probabilistic Models) - one of the first samplers, requires a large number of steps to achieve a decent result.
DDIM (Denoising Diffusion Implicit Models) - an improved DDPM resulting in a faster sampler with better quality.
Samplers can also have an ancestral version (eg Euler vs Euler A), which means that some randomness is added at each sampling step. This provides greater creativity but the image would not converge even with a large number of sampling steps.
The VAE is a key component of many text-to-image AI models, and is responsible for:
encoding a larger image (eg 512 x 512 pixels) into a smaller "latent" representation of that image (eg 64 x 64 pixels)
decoding a smaller latent representation of that image back into the larger dimensions
Using this technique allows image-generation models to work on the latent space, which requires multiple times less memory than working directly on the original images.
In general, a model checkpoint which is trained with a VAE will need to include the same VAE when generating new images. That is, during the model training the VAE encodes images into a latent space and the same VAE is needed when generating images to decode points from the latest space back into images.