Adding out-of-the-box support for multilingual models like BAAI/AltDiffusion to diffusers #2135

jslegers · 2023-01-27T04:28:47Z

Description of the problem

I'm unable to load multilingual models like BAAI/AltDiffusion with diffusers.

The solution I'd like

I would like diffusers to have out-of-the-box support for models like BAAI/AltDiffusion, that use AltCLIP or a different multilingual CLIP model.

Alternatives I've considered

I tried loading BAAI/AltDiffusion, but it told me that I was missing a library.

After loading that library, it produced a different error.

I kinda gave up after that.

Additional context

Since AltCLIP has a max_position_embeddings value of 514 for its text encoder instead of 77, I had hoped I could just replace the text encoder and tokenizer of my models with those of BAAI/AltDiffusion to overcome the 77 token limit.

The text was updated successfully, but these errors were encountered:

apolinario · 2023-01-27T15:18:50Z

AltDiffusion has its own pipeline as you can see in the model page

from diffusers import AltDiffusionPipeline, DPMSolverMultistepScheduler
import torch

pipe = AltDiffusionPipeline.from_pretrained("BAAI/AltDiffusion", torch_dtype=torch.float16, revision="fp16")
pipe = pipe.to("cuda")

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

prompt = "黑暗精灵公主，非常详细，幻想，非常详细，数字绘画，概念艺术，敏锐的焦点，插图"
# or in English:
# prompt = "dark elf princess, highly detailed, d & d, fantasy, highly detailed, digital painting, trending on artstation, concept art, sharp focus, illustration, art by artgerm and greg rutkowski and fuji choko and viktoria gavrilenko and hoang lap"

image = pipe(prompt, num_inference_steps=25).images[0]
image.save("./alt.png")

If that didn't work, please feel free to describe a bit more what error was produced.

jslegers · 2023-01-27T16:29:42Z

AltDiffusion has its own pipeline as you can see in the model page

Would it be an option for you guys to add support for these models to the StableDiffusionPipeline?

Almost all SD diffusers code I've seen uses the StableDiffusionPipeline. If the models I release aren't compatible with the StableDiffusionPipeline, most people aren't going to figure out how to use it, which is going to seriously harm the potential adoption rate of my models.

It's also more user friendly IMO to have one unified StableDiffusionPipeline API for all SD models rather than having to use a dedicated AltDiffusionPipeline for AltDiffusion, another one for another variation, etc...

patrickvonplaten · 2023-01-31T08:23:13Z

Hey @jslegers,

Pipelines are not supposed to be "all-in-one" toolboxes, but rather examples of how to run a certain model.
Note that AltDiffusion is not the same as stable diffusion since a different text encoder is used amongst others.

For more in-detail information you can have a look at the docs here:
https://huggingface.co/docs/diffusers/main/en/conceptual/philosophy#pipelines

jslegers · 2023-01-31T08:34:10Z

Pipelines are not supposed to be "all-in-one" toolboxes, but rather examples of how to run a certain model.

Pipelines are just examples?

* confused *

Note that AltDiffusion is not the same as stable diffusion since a different text encoder is used amongst others.

If the only difference between Stable Diffusion & AltDiffusion is a different text encoder & tokenizer, they're pretty much the same thing from an end user perspective, as changing every Stable Diffusion model into an AltDiffusion model (or vice versa) is trivial and can be achieved with a few tweaks to JSON files & drag-and-dropping some files...

Or am I missing something?

apolinario · 2023-02-01T12:13:57Z

I think you can use the generic DiffusionPipeline that should auto-detect the pipeline according to the model, so with DiffusionPipeline.from_pretrained(model_id) you could load both Stable Diffuson and AltDiffusion models.

from diffusers import DiffusionPipeline
#model_id can be either a Stable Diffusion or AltDiffusion model
pipe = DiffusionPipeline.from_pretrained(model_id)

However, I gotta say I do understand and resonate with your desire of having a super powerful all-in-one pipeline that loads multiple models and contains all the latest features (incl. getting rid of the 77 token limits and maybe more), however I think maintaining such a pipeline - also making sure all features are inter-compatible with one another - may be beyond the scope of what the library maintainers aim for the library as you can see in the pipeline philosophy docs @patrickvonplaten has shared above.

However if you (or anyone in the community!) would be interested in maintaining an all-in-one interoperable pipeline with as many features possible - one can feel free to add it as a community pipeline - community pipelines are maintained by the community and loadable from the library. The community pipeline Stable Diffusion Mega is an example of how this merging multiple pipelines could work out if expanded to even more features.

jslegers · 2023-02-01T20:26:07Z

@apolinario :

I think you can use the generic DiffusionPipeline that should auto-detect the pipeline according to the model, so with DiffusionPipeline.from_pretrained(model_id) you could load both Stable Diffuson and AltDiffusion models.

That's definitely acceptable for personal use... as in code I write myself for eg. demos or prototypes.

But it's not a suitable approach for when I create a custom diffusion model (customized through Dreambooth, LoRa and/or merging checkpoints) that I want others to use. When you release something to broader public, you want it to be as simple / dummy-proof as possible. For those with little to no experience with AI, learning how to use Stable Diffusion can already be a steep learning curve. Having to learn to code in Python, use Google Colab & become familiar with the diffusers API only adds to the steepness of the learning curve.

With so much demo code & tutorial out out there using StableDiffusionPipeline rather than DiffusionPipeline, many - if not most - are going to use the StableDiffusionPipeline by default. It's going to leave a lot of people puzzled why their StableDiffusionPipeline throws errors when try to load my models and it's very likely the vast majority is going to throw their hands up in the air and just give up rather than do what it takes to figure out they should be using a DiffusionPipeline instead.

I think maintaining such a pipeline - also making sure all features are inter-compatible with one another - may be beyond the scope of what the library maintainers aim for the library as you can see in the pipeline philosophy docs @patrickvonplaten has shared above.

Fair point.

Having worked in R&D on a high-end JavaScript GIS library with a similar philosophy in the past, for a former employer, I very much appreciate and respect that you want to avoid over-engineering this library and turn it into yet another convoluted monolith that tries to do everything poorly. I appreciate that you want to keep things simple and that you want to empower the user of the library rather than give too much power to the library itself.

A good compromise, in my very humble opinion, would be to optimize error reporting and inform the user when they're using the wrong pipeline and which pipeline they should be using instead. So, for example, if they're using a StableDiffusionPipeline wheren trying to load an AltDiffusion model, it could thrown an error with a message like Error : You are trying to load an AltDiffusion with StableDiffusionPipeline. Please use AltDiffusionPipeline or the generic DiffusionPipeline instead.

This way, users at least know what they're doing wrong and how to fix it without having to waste lots of precious time wading through Github issues or posts on the Huggingface community. Basically, what I'm saying is that errors associated with trying to load the wrong kind of diffusion model into a particular pipeline should be dummy-proof enough for the error message to be all the documentation needed for the users to figure out how to fix their mistake.

However if you (or anyone in the community!) would be interested in maintaining an all-in-one interoperable pipeline with as many features possible - one can feel free to add it as a community pipeline

I'll consider it if and when I have the time for it. Right now, creating my own custom Stable Diffusion Models, testing my models, creating demo apps for them, marketing them, experimenting with chatGPT and its API, and trying to come up with a business model that allows me to monetize my experience with all this shit is already more than full-time job. I'm already biting way more than I can chew and struggling to prioritize my ongoing activities, all while still learning to find my way in what was a completely unknown ecosystem to me just a few months ago.

Also, I'm not convinced adding an additional community pipeline alongside existing pipelines, that supports all of the models supported by every other pipeline, is the way to go. IMO this would result in nothing but unnecessary redundancy & bloat.

Based on my understanding of the current diffusers API, my approach would be 2-fold :

Specific models for specific diffusion models : StableDiffusionPipeline for StableDiffusion model, AltDiffusionPipeline for AltDiffusion model, etc.
A singular generic DiffusionPipeline that detects which type of diffusion model is being loaded and then dynamically loads the corresponding diffusion model and delegates the implementation of its methods to the methods of that model.

Additionally, ...

Each pipeline should detect whether the right type of model is loaded and throw an error message in a consistent format informing the user of what they're doing wrong and how to fix it
The name of each diffusion model & each diffusion pipeline should match, to make it self-evident which pipeline to use in which case
Each diffusion model should implement at least each of the methods of the DiffusionPipeline as a shared interface
N00b users should be encouraged to use the generic DiffusionPipeline because this makes it easier to switch models without having to worry about the differences between different models. More expert users should be uncouraged to use model-specific pipelines, because it means one less of abstraction and it grants them access to methods specific for that model, which wouldn't be available in a more generic DiffusionPipeline

But hey...

patrickvonplaten · 2023-02-03T17:20:07Z

Very much agree with all of your 4 points here:

Each pipeline should detect whether the right type of model is loaded and throw an error message in a consistent format informing the user of what they're doing wrong and how to fix it

Very good point! We could/should think about a nice system here. Note that when loading with DiffusionPipeline we'll always use the correct pipeline class as defined here: https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/39593d5650112b4cc580433f6b0435385882d819/model_index.json#L2

We could nevertheless make sure better error messages are thrown when something like:

from diffusers import StableDiffusionInpaintPipeline

pipeline = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")

which is probably probs not giving a good error message

The name of each diffusion model & each diffusion pipeline should match, to make it self-evident which pipeline to use in which case

Models and Pipeline are a bit on different levels as models are sub-components of pipelines. See here for more explanation:
https://huggingface.co/docs/diffusers/main/en/conceptual/philosophy

Each diffusion model should implement at least each of the methods of the DiffusionPipeline as a shared interface

Yes they abstract from the DiffusionPipeline so they have to

N00b users should be encouraged to use the generic DiffusionPipeline because this makes it easier to switch models without having to worry about the differences between different models. More expert users should be uncouraged to use model-specific pipelines, because it means one less of abstraction and it grants them access to methods specific for that model, which wouldn't be available in a more generic DiffusionPipeline

100% agree - we should probs change all docs to just always use DiffusionPipeline

patrickvonplaten · 2023-02-03T17:21:17Z

Think 1. and 4. are quite actionable, in case anybody wants to pick them up to open PRs

dg845 · 2023-03-22T05:10:45Z

Hi, new contributor here. I've investigated 1. a bit and found that if we have something like

from diffusers import StableDiffusionInpaintPipeline

pipeline = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")

this doesn't actually throw an error: it succeeds and pipeline is of type StableDiffusionInpaintPipeline (as expected).

However, if we try to generate an image from this pipeline via __call__ following the StableDiffusionInpaintPipeline example:

import PIL
import requests
import torch
from io import BytesIO
from diffusers import StableDiffusionInpaintPipeline

def download_image(url):
	response = requests.get(url)
	return PIL.Image.open(BytesIO(response.content)).convert("RGB")

img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"

init_image = download_image(img_url).resize((512, 512))
mask_image = download_image(mask_url).resize((512, 512))

pipe = StableDiffusionInpaintPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
)

pipe.to("cuda")

prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
image = pipe(prompt=prompt, image=init_image, mask_image=mask_image).images[0]

we get an error on the last line:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <module>                                                                                      │
│                                                                                                  │
│ /home/tamamo/miniconda3/envs/diffusers-dev/lib/python3.10/site-packages/torch/autograd/grad_mode │
│ .py:27 in decorate_context                                                                       │
│                                                                                                  │
│    24 │   │   @functools.wraps(func)                                                             │
│    25 │   │   def decorate_context(*args, **kwargs):                                             │
│    26 │   │   │   with self.clone():                                                             │
│ ❱  27 │   │   │   │   return func(*args, **kwargs)                                               │
│    28 │   │   return cast(F, decorate_context)                                                   │
│    29 │                                                                                          │
│    30 │   def _wrap_generator(self, func):                                                       │
│                                                                                                  │
│ /home/tamamo/code/diffusers/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_i │
│ npaint.py:834 in __call__                                                                        │
│                                                                                                  │
│   831 │   │   num_channels_mask = mask.shape[1]                                                  │
│   832 │   │   num_channels_masked_image = masked_image_latents.shape[1]                          │
│   833 │   │   if num_channels_latents + num_channels_mask + num_channels_masked_image != self.   │
│ ❱ 834 │   │   │   raise ValueError(                                                              │
│   835 │   │   │   │   f"Incorrect configuration settings! The config of `pipeline.unet`: {self   │
│   836 │   │   │   │   f" {self.unet.config.in_channels} but received `num_channels_latents`: {   │
│   837 │   │   │   │   f" `num_channels_mask`: {num_channels_mask} + `num_channels_masked_image   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Incorrect configuration settings! The config of `pipeline.unet`: FrozenDict([('sample_size', 64),
('in_channels', 4), ('out_channels', 4), ('center_input_sample', False), ('flip_sin_to_cos', True), ('freq_shift', 0),
('down_block_types', ['CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'DownBlock2D']),
('mid_block_type', 'UNetMidBlock2DCrossAttn'), ('up_block_types', ['UpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D',
'CrossAttnUpBlock2D']), ('only_cross_attention', False), ('block_out_channels', [320, 640, 1280, 1280]),
('layers_per_block', 2), ('downsample_padding', 1), ('mid_block_scale_factor', 1), ('act_fn', 'silu'), ('norm_num_groups',
32), ('norm_eps', 1e-05), ('cross_attention_dim', 768), ('attention_head_dim', 8), ('dual_cross_attention', False),
('use_linear_projection', False), ('class_embed_type', None), ('num_class_embeds', None), ('upcast_attention', False),
('resnet_time_scale_shift', 'default'), ('time_embedding_type', 'positional'), ('timestep_post_act', None),
('time_cond_proj_dim', None), ('conv_in_kernel', 3), ('conv_out_kernel', 3), ('projection_class_embeddings_input_dim',
None), ('_class_name', 'UNet2DConditionModel'), ('_diffusers_version', '0.6.0'), ('_name_or_path',
'/home/tamamo/.cache/huggingface/hub/models--runwayml--stable-diffusion-v1-5/snapshots/39593d5650112b4cc580433f6b0435385882d
819/unet')]) expects 4 but received `num_channels_latents`: 4 + `num_channels_mask`: 1 + `num_channels_masked_image`: 4 = 9.
Please verify the config of `pipeline.unet` or your `mask_image` or `image` input.

which makes sense, because the inpainting pipeline expects the initial layer of the UNet to have channels for the mask and masked image as well as the latent base image, whereas the base stable diffusion model checkpoint runwayml/stable-diffusion-v1-5 only has channels for the latents.

Perhaps I'm not familiar enough with the code, but it's unclear to me how we could generically catch errors of this sort from inside from_pretrained(...). Perhaps there is a good way to expose these type of shape requirements so that they can be checked from within from_pretrained(...)? (I could also imagine that there might be other sorts of "semantic" errors beside the model being the wrong shape that could result from a model-pipeline mismatch).

As for the error message itself, perhaps adding something like "please verify that the model and pipeline are consistent" might be helpful?

(I'm also working on 4. but haven't made enough progress to submit a PR.)

patrickvonplaten · 2023-03-23T12:52:32Z

@dg845 could you open a new issue? Happy to answer there :-)

dg845 · 2023-03-23T23:50:30Z

I have opened a new issue at #2799.

mcgeestocks · 2023-05-09T19:18:10Z

Since the merging of #2799 and #2809 is there anything else left to do here or should this be closed @patrickvonplaten?

patrickvonplaten · 2023-05-10T18:27:25Z

Good call think we can close this one no @jslegers ?

jslegers · 2024-07-31T08:43:04Z

Good call think we can close this one no @jslegers ?

Not just yet.

I'm still struggling with the limit of 77 tokens, which is way too small for many of my prompts.

Is there a way to combine the unet of a Stable Diffusion with the text encoder of another model, like FFusion/FFusionXL-BASE?

I'd like to keep the image generation capabilities of whichever SDXL model I'm using, but add better prompt support. More in particular, I'm looking for a way to overcome the limit of 77 tokens without just ignoring most of the prompt (See my initial post).

I vaguely remember trying to just swap text encoders with in the past, with it producing errors.

See also #8977 (comment) for a comment where I just referenced this one.

jslegers · 2024-07-31T09:37:45Z

Ah, nevermind, I just saw this.

I'm going to give that approach a try.

I'll close this ticket myself.

jslegers mentioned this issue Jan 27, 2023

Overcoming the 77 token limit in diffusers #2136

Closed

patrickvonplaten added the good first issue Good for newcomers label Feb 3, 2023

dg845 mentioned this issue Mar 23, 2023

Raise Error on Mismatched Model Checkpoint and Pipeline #2799

Closed

dg845 mentioned this issue Mar 24, 2023

[WIP][Docs] Use DiffusionPipeline Instead of Child Classes when Loading Pipeline #2809

Merged

jslegers mentioned this issue Jul 31, 2024

Merge pipelines and/or checkpoints #8977

Open

jslegers closed this as completed Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding out-of-the-box support for multilingual models like BAAI/AltDiffusion to diffusers #2135

Adding out-of-the-box support for multilingual models like BAAI/AltDiffusion to diffusers #2135

jslegers commented Jan 27, 2023 •

edited

Loading

apolinario commented Jan 27, 2023

jslegers commented Jan 27, 2023

patrickvonplaten commented Jan 31, 2023

jslegers commented Jan 31, 2023 •

edited

Loading

apolinario commented Feb 1, 2023 •

edited

Loading

jslegers commented Feb 1, 2023 •

edited

Loading

patrickvonplaten commented Feb 3, 2023 •

edited

Loading

patrickvonplaten commented Feb 3, 2023

dg845 commented Mar 22, 2023

patrickvonplaten commented Mar 23, 2023

dg845 commented Mar 23, 2023

mcgeestocks commented May 9, 2023

patrickvonplaten commented May 10, 2023

jslegers commented Jul 31, 2024 •

edited

Loading

jslegers commented Jul 31, 2024

Adding out-of-the-box support for multilingual models like BAAI/AltDiffusion to diffusers #2135

Adding out-of-the-box support for multilingual models like BAAI/AltDiffusion to diffusers #2135

Comments

jslegers commented Jan 27, 2023 • edited Loading

Description of the problem

The solution I'd like

Alternatives I've considered

Additional context

apolinario commented Jan 27, 2023

jslegers commented Jan 27, 2023

patrickvonplaten commented Jan 31, 2023

jslegers commented Jan 31, 2023 • edited Loading

apolinario commented Feb 1, 2023 • edited Loading

jslegers commented Feb 1, 2023 • edited Loading

patrickvonplaten commented Feb 3, 2023 • edited Loading

patrickvonplaten commented Feb 3, 2023

dg845 commented Mar 22, 2023

patrickvonplaten commented Mar 23, 2023

dg845 commented Mar 23, 2023

mcgeestocks commented May 9, 2023

patrickvonplaten commented May 10, 2023

jslegers commented Jul 31, 2024 • edited Loading

jslegers commented Jul 31, 2024

jslegers commented Jan 27, 2023 •

edited

Loading

jslegers commented Jan 31, 2023 •

edited

Loading

apolinario commented Feb 1, 2023 •

edited

Loading

jslegers commented Feb 1, 2023 •

edited

Loading

patrickvonplaten commented Feb 3, 2023 •

edited

Loading

jslegers commented Jul 31, 2024 •

edited

Loading