Stable Diffusion
Mix.install([
{:bumblebee, "~> 0.6.0"},
{:nx, "~> 0.9.0"},
{:exla, "~> 0.9.0"},
{:kino, "~> 0.14.0"}
])
Nx.global_default_backend({EXLA.Backend, client: :host})
Introduction
Stable Diffusion is a latent text-to-image diffusion model, primarily used to generate images based on a text prompt. Ever since it became open-source, the research, applications and tooling around it exploded. You can find a ton of resources and examples online, meanwhile let’s see how to run Stable Diffusion using Bumblebee!
> Note: Stable Diffusion is a very involved model, so the generation can take a long time if you run it on a CPU. Also, running on the GPU currently requires at least 5GiB of VRAM (or 3GiB with lower speed, see below).
Text to image
Stable Diffusion is composed of several separate models and preprocessors, so we will load all of them.
repo_id = "CompVis/stable-diffusion-v1-4"
opts = [params_variant: "fp16", type: :bf16, backend: EXLA.Backend]
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/clip-vit-large-patch14"})
{:ok, clip} = Bumblebee.load_model({:hf, repo_id, subdir: "text_encoder"}, opts)
{:ok, unet} = Bumblebee.load_model({:hf, repo_id, subdir: "unet"}, opts)
{:ok, vae} = Bumblebee.load_model({:hf, repo_id, subdir: "vae"}, [architecture: :decoder] ++ opts)
{:ok, scheduler} = Bumblebee.load_scheduler({:hf, repo_id, subdir: "scheduler"})
{:ok, featurizer} = Bumblebee.load_featurizer({:hf, repo_id, subdir: "feature_extractor"})
{:ok, safety_checker} = Bumblebee.load_model({:hf, repo_id, subdir: "safety_checker"}, opts)
:ok
> Note: some checkpoints, such as runwayml/stable-diffusion-v1-5, require a license agreement. In those cases, sign up on Hugging Face, accept the license on the repository page, generate an access token in the settings and add it to the repository specification via :auth_token
. You can use Livebook secrets to pass the token securely.
With all the models loaded, we can now configure a serving implementation of the text-to-image task.
serving =
Bumblebee.Diffusion.StableDiffusion.text_to_image(clip, unet, vae, tokenizer, scheduler,
num_steps: 20,
num_images_per_prompt: 1,
safety_checker: safety_checker,
safety_checker_featurizer: featurizer,
compile: [batch_size: 1, sequence_length: 60],
# Option 1
defn_options: [compiler: EXLA]
# Option 2 (reduces GPU usage, but runs noticeably slower)
# Also remove `backend: EXLA.Backend` from the loading options above
# defn_options: [compiler: EXLA, lazy_transfers: :always]
)
Kino.start_child({Nx.Serving, name: StableDiffusion, serving: serving})
prompt_input =
Kino.Input.text("Prompt", default: "numbat, forest, high quality, detailed, digital art")
negative_prompt_input = Kino.Input.text("Negative Prompt", default: "darkness, rainy, foggy")
Kino.Layout.grid([prompt_input, negative_prompt_input])
We are ready to generate images!
prompt = Kino.Input.read(prompt_input)
negative_prompt = Kino.Input.read(negative_prompt_input)
output =
Nx.Serving.batched_run(StableDiffusion, %{prompt: prompt, negative_prompt: negative_prompt})
for result <- output.results do
Kino.Image.new(result.image)
end
|> Kino.Layout.grid(columns: 2)
To achieve a better quality you can increase the number of steps and images.