Powered by AppSignal & Oban Pro
Would you like to see your link here? Contact us

Stable Diffusion

books/ai/stable_diffusion.livemd

Stable Diffusion

Mix.install([
  {:bumblebee, "~> 0.1.2"},
  {:nx, "~> 0.4.1"},
  {:exla, "~> 0.4.1"},
  {:kino, "~> 0.8.0"}
])

Nx.global_default_backend(EXLA.Backend)

Introduction

Stable Diffusion is a latent text-to-image diffusion model, primarily used to generate images based on a text prompt. Ever since it became open-source, the research, applications and tooling around it exploded. You can find a ton of resources and examples online, meanwhile let’s see how to run Stable Diffusion using Bumblebee!

Text to image

Stable Diffusion is composed of several separate models and preprocessors, so we will load all of them.

repository_id = "CompVis/stable-diffusion-v1-4"

{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/clip-vit-large-patch14"})

{:ok, clip} = Bumblebee.load_model({:hf, repository_id, subdir: "text_encoder"})

{:ok, unet} =
  Bumblebee.load_model({:hf, repository_id, subdir: "unet"},
    params_filename: "diffusion_pytorch_model.bin"
  )

{:ok, vae} =
  Bumblebee.load_model({:hf, repository_id, subdir: "vae"},
    architecture: :decoder,
    params_filename: "diffusion_pytorch_model.bin"
  )

{:ok, scheduler} = Bumblebee.load_scheduler({:hf, repository_id, subdir: "scheduler"})
{:ok, featurizer} = Bumblebee.load_featurizer({:hf, repository_id, subdir: "feature_extractor"})
{:ok, safety_checker} = Bumblebee.load_model({:hf, repository_id, subdir: "safety_checker"})

:ok

> Note: some checkpoints, such as runwayml/stable-diffusion-v1-5, require a license agreement. In those cases, sign up on Hugging Face, accept the license on the repository page, generate an access token in the settings and add it to the repository specification via :auth_token. You can use Livebook secrets to pass the token securely.

With all the models loaded, we can now configure a serving implementation of the text-to-image task.

serving =
  Bumblebee.Diffusion.StableDiffusion.text_to_image(clip, unet, vae, tokenizer, scheduler,
    num_steps: 20,
    num_images_per_prompt: 2,
    safety_checker: safety_checker,
    safety_checker_featurizer: featurizer,
    compile: [batch_size: 1, sequence_length: 60],
    defn_options: [compiler: EXLA]
  )

text_input =
  Kino.Input.text("Prompt", default: "numbat, forest, high quality, detailed, digital art")
negative_input = Kino.Input.text("Negative Prompt", default: "darkness, rainy, foggy")

We are ready to generate images!

prompt = Kino.Input.read(text_input)
negative_prompt = Kino.Input.read(negative_input)

output = Nx.Serving.run(serving, %{prompt: prompt, negative_prompt: negative_prompt})

for result <- output.results do
  Kino.Image.new(result.image)
end
|> Kino.Layout.grid(columns: 2)

Note: Stable Diffusion is a very involved model, so the generation can take a long time if you run it on a CPU.

If you have a GPU available, feel free to increase the number of steps and images to achieve a better quality.