Overcoming the 77 token limit in diffusers #2136

jslegers · 2023-01-27T04:38:41Z

Description of the problem

CLIP has a 77 token limit, which is much too small for many prompts.

Several GUIs have found a way to overcome this limit, but not the diffusers library.

The solution I'd like

I would like diffusers to be able to run longer prompts and overcome the 77 token limit of CLIP for any model, much like the AUTOMATIC1111/stable-diffusion-webui already does.

Alternatives I've considered

I tried reverse-engineering the prompt interpretation logic from one of the other GUIs out there (not sure which one), but I couldn't find the code responsible.
I tried running the BAAI/AltDiffusion in diffusers, which uses AltCLIP instead of CLIP. Since AltCLIP has a max_position_embeddings value of 514 for its text encoder instead of 77, I had hoped I could just replace the text encoder and tokenizer of my models with those of BAAI/AltDiffusion to overcome the 77 token limit, but I couldn't get the BAAI/AltDiffusion to work in diffusers

Additional context

This is how the AUTOMATIC1111 overcomes the token limit, according to their documentation :

Typing past standard 75 tokens that Stable Diffusion usually accepts increases prompt size limit from 75 to 150. Typing past that increases prompt size further. This is done by breaking the prompt into chunks of 75 tokens, processing each independently using CLIP's Transformers neural network, and then concatenating the result before feeding into the next component of stable diffusion, the Unet.

For example, a prompt with 120 tokens would be separated into two chunks: first with 75 tokens, second with 45. Both would be padded to 75 tokens and extended with start/end tokens to 77. After passing those two chunks though CLIP, we'll have two tensors with shape of (1, 77, 768). Concatenating those results in (1, 154, 768) tensor that is then passed to Unet without issue.

The text was updated successfully, but these errors were encountered:

apolinario · 2023-01-27T15:15:48Z

Hey @jslegers, the Long Prompt Weighting Stable Diffusion community pipeline gets rid of the 77 token limit. You can check it out here

jslegers · 2023-01-27T16:38:26Z

@apolinario :

I have the same question / remark I made @ #2135.

Most people aren't going to figure out on their own that there is a dedicated pipeline to get rid of the 77 token limit. I sure wasn't able to find this info until you provided me a link... and I'm a dev with more than a decade of experience.

It's also not exactly user friendly to have a dedicated pipeline for what's a pretty important feature almost every Stable Diffusion user is likely to want (since it doesn't take much to surpass 77 tokens).

So why not just bake support for +77 tokens into StableDiffusionPipeline?

patrickvonplaten · 2023-01-31T08:50:41Z

Hey @jslegers,

It's true that our documentation is currently lacking behind a bit. Would you be interested in contributing a doc page about long prompting?

Also note that I would suggest to just use StableDiffusionPipeline and pass the prompt_embeds manually, e.g. the following code snippet works:

from diffusers import StableDiffusionPipeline
import torch

# 1. load model
model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

# 2. Forward embeddings and negative embeddings through text encoder
prompt = 25 * "a photo of an astronaut riding a horse on mars"
max_length = pipe.tokenizer.model_max_length

input_ids = pipe.tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to("cuda")

negative_ids = pipe.tokenizer("", truncation=False, padding="max_length", max_length=input_ids.shape[-1], return_tensors="pt").input_ids                                                                                                     
negative_ids = negative_ids.to("cuda")

concat_embeds = []
neg_embeds = []
for i in range(0, input_ids.shape[-1], max_length):
    concat_embeds.append(pipe.text_encoder(input_ids[:, i: i + max_length])[0])
    neg_embeds.append(pipe.text_encoder(negative_ids[:, i: i + max_length])[0])

prompt_embeds = torch.cat(concat_embeds, dim=1)
negative_prompt_embeds = torch.cat(neg_embeds, dim=1)

# 3. Forward
image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_prompt_embeds).images[0]
image.save("astronaut_rides_horse.png")

Could you try out whether this fits your use case? Would you be interested in adding a doc page about long-prompting maybe under: https://github.com/huggingface/diffusers/tree/main/docs/source/en/using-diffusers

jslegers · 2023-01-31T09:16:50Z

Also note that I would suggest to just use StableDiffusionPipeline and pass the prompt_embeds manually, e.g. the following code snippet works:

[...]

Could you try out whether this fits your use case?

It's an interesting approach and definitely more in line with what I'm looking for...

I'll need to try this on my demos and test scripts before I can comment on it further, but it looks promising as an approach for at least personal use...

I'd still argue this is a bit convoluted for something that Stable Diffusion should support out of the box, but I guess that's something RunwayML and StablilityAI should fix (by replacing CLIP with an alternative that supports more tokens) and not something the diffusers library is responsible for.

Would you be interested in adding a doc page about long-prompting maybe under: https://github.com/huggingface/diffusers/tree/main/docs/source/en/using-diffusers

I'll take that into consideration, under the condition I'm allowed to post that same content on my own blog(s) as well.

I was planning to do some tutorials on how to use Stable Diffusion anyway, so I might as well make some of that content official documentation.

patrickvonplaten · 2023-01-31T13:52:07Z

Feel free to use every content of diffusers in whatever way you like :-) It's MIT licensed

jslegers · 2023-02-01T21:46:33Z

@patrickvonplaten

Feel free to use every content of diffusers in whatever way you like :-) It's MIT licensed

Good to know...

Wasn't sure that license applied to documentation as well.

I'm not a lawyer, and I prefer to make as little assumptions as possible when it involves legal matters...

github-actions · 2023-02-26T15:03:21Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

romanfurman6 · 2023-04-11T09:20:42Z

Hey @jslegers,

It's true that our documentation is currently lacking behind a bit. Would you be interested in contributing a doc page about long prompting?

Also note that I would suggest to just use StableDiffusionPipeline and pass the prompt_embeds manually, e.g. the following code snippet works:

from diffusers import StableDiffusionPipeline
import torch

# 1. load model
model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

# 2. Forward embeddings and negative embeddings through text encoder
prompt = 25 * "a photo of an astronaut riding a horse on mars"
max_length = pipe.tokenizer.model_max_length

input_ids = pipe.tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to("cuda")

negative_ids = pipe.tokenizer("", truncation=False, padding="max_length", max_length=input_ids.shape[-1], return_tensors="pt").input_ids                                                                                                     
negative_ids = negative_ids.to("cuda")

concat_embeds = []
neg_embeds = []
for i in range(0, input_ids.shape[-1], max_length):
    concat_embeds.append(pipe.text_encoder(input_ids[:, i: i + max_length])[0])
    neg_embeds.append(pipe.text_encoder(negative_ids[:, i: i + max_length])[0])

prompt_embeds = torch.cat(concat_embeds, dim=1)
negative_prompt_embeds = torch.cat(neg_embeds, dim=1)

# 3. Forward
image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_prompt_embeds).images[0]
image.save("astronaut_rides_horse.png")

Could you try out whether this fits your use case? Would you be interested in adding a doc page about long-prompting maybe under: https://github.com/huggingface/diffusers/tree/main/docs/source/en/using-diffusers

Hey, with this example, getting such error: "Token indices sequence length is longer than the specified maximum sequence length for this model (X > 77). Running this sequence through the model will result in indexing errors"
is that okay?

patrickvonplaten · 2023-04-12T12:15:54Z

Hey @romanfurman6,

Could you please open a new issue for it? :-)

andrevanzuydam · 2023-04-19T08:28:28Z

Hey @jslegers,

It's true that our documentation is currently lacking behind a bit. Would you be interested in contributing a doc page about long prompting?

Also note that I would suggest to just use StableDiffusionPipeline and pass the prompt_embeds manually, e.g. the following code snippet works:

from diffusers import StableDiffusionPipeline
import torch

# 1. load model
model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

# 2. Forward embeddings and negative embeddings through text encoder
prompt = 25 * "a photo of an astronaut riding a horse on mars"
max_length = pipe.tokenizer.model_max_length

input_ids = pipe.tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to("cuda")

negative_ids = pipe.tokenizer("", truncation=False, padding="max_length", max_length=input_ids.shape[-1], return_tensors="pt").input_ids                                                                                                     
negative_ids = negative_ids.to("cuda")

concat_embeds = []
neg_embeds = []
for i in range(0, input_ids.shape[-1], max_length):
    concat_embeds.append(pipe.text_encoder(input_ids[:, i: i + max_length])[0])
    neg_embeds.append(pipe.text_encoder(negative_ids[:, i: i + max_length])[0])

prompt_embeds = torch.cat(concat_embeds, dim=1)
negative_prompt_embeds = torch.cat(neg_embeds, dim=1)

# 3. Forward
image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_prompt_embeds).images[0]
image.save("astronaut_rides_horse.png")

Could you try out whether this fits your use case? Would you be interested in adding a doc page about long-prompting maybe under: https://github.com/huggingface/diffusers/tree/main/docs/source/en/using-diffusers

I'm testing your code sample as I haven't been able to get the custom pipeline lpw_stable_diffusion to work on all computers I am testing on, I'm happy to document my findings. Thanks for the code.

andrevanzuydam · 2023-04-19T15:49:04Z

Just for in case someone comes accross this issue and wants a solution, I built something that works for both prompts correctly of varying lengths

from diffusers import StableDiffusionPipeline
import torch
import random

# 1. load model
model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")


pipe.enable_sequential_cpu_offload() # my graphics card VRAM is very low


def get_pipeline_embeds(pipeline, prompt, negative_prompt, device):
    """ Get pipeline embeds for prompts bigger than the maxlength of the pipe
    :param pipeline:
    :param prompt:
    :param negative_prompt:
    :param device:
    :return:
    """
    max_length = pipeline.tokenizer.model_max_length

    # simple way to determine length of tokens
    count_prompt = len(prompt.split(" "))
    count_negative_prompt = len(negative_prompt.split(" "))

    # create the tensor based on which prompt is longer
    if count_prompt >= count_negative_prompt:
        input_ids = pipeline.tokenizer(prompt, return_tensors="pt", truncation=False).input_ids.to(device)
        shape_max_length = input_ids.shape[-1]
        negative_ids = pipeline.tokenizer(negative_prompt, truncation=False, padding="max_length",
                                          max_length=shape_max_length, return_tensors="pt").input_ids.to(device)

    else:
        negative_ids = pipeline.tokenizer(negative_prompt, return_tensors="pt", truncation=False).input_ids.to(device)
        shape_max_length = negative_ids.shape[-1]
        input_ids = pipeline.tokenizer(prompt, return_tensors="pt", truncation=False, padding="max_length",
                                       max_length=shape_max_length).input_ids.to(device)

    concat_embeds = []
    neg_embeds = []
    for i in range(0, shape_max_length, max_length):
        concat_embeds.append(pipeline.text_encoder(input_ids[:, i: i + max_length])[0])
        neg_embeds.append(pipeline.text_encoder(negative_ids[:, i: i + max_length])[0])

    return torch.cat(concat_embeds, dim=1), torch.cat(neg_embeds, dim=1)


prompt = (22 + random.randint(1, 10)) * "a photo of an astronaut riding a horse on mars"
negative_prompt = (22 + random.randint(1, 10)) * "some negative texts"

print("Our inputs ", prompt, negative_prompt, len(prompt.split(" ")), len(negative_prompt.split(" ")))

prompt_embeds, negative_prompt_embeds = get_pipeline_embeds(pipe, prompt, negative_prompt, "cuda")

image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_prompt_embeds).images[0]

image.save("done.png")

andrevanzuydam · 2023-04-20T10:08:25Z

The code above does weird things with commas and special chars, just an FYI, not sure if I need to regex the prompts to sanitize, probably lends to word prioritization etc

djj0s3 · 2023-06-02T20:51:07Z

Bumping this thread back up. The Long Prompt Weighting Stable Diffusion is great but it doesn't mix and match well (correct me if I'm wrong) when using other default pipelines, like ControlNet for example. I believe the spirit of Diffusers is like legos for working with diffusion models. But relying on a community pipeline for this workaround breaks that pattern a bit. I'd love some help on a standard way to add in the best parts of the expanded long prompt weighting pipeline without having to solely use that pipeline.

patrickvonplaten · 2023-06-05T10:05:15Z

Hey @djj0s3,

Yes good point. Can you try whether you can solve the same use case you had by using: https://huggingface.co/docs/diffusers/main/en/using-diffusers/weighted_prompts

djj0s3 · 2023-06-06T19:26:58Z

Thanks! That worked after I ran into some "I'm an idiot" errors. For anyone else that lands on here, read the Compel docs carefully - particularly this bit if you're using long tokens or you will get into mismatching tensor issues and be very sad.

compel = Compel(..., truncate_long_prompts=False)
prompt = "a cat playing with a ball++ in the forest, amazing, exquisite, stunning, masterpiece, skilled, powerful, incredible, amazing, trending on gregstation, greg, greggy, greggs greggson, greggy mcgregface, ..." # very long prompt
conditioning = compel.build_conditioning_tensor(prompt)
negative_prompt = "" # it's necessary to create an empty prompt - it can also be very long, if you want
negative_conditioning = compel.build_conditioning_tensor(negative_prompt)
[conditioning, negative_conditioning] = compel.pad_conditioning_tensors_to_same_length([conditioning, negative_conditioning])

DevonPeroutky · 2023-06-28T01:45:23Z

@patrickvonplaten Potentially stupid question, but if you directly pass the [neg]/prompt_embeddings into the pipeline, does that mean there's no longer an attention mask being used?

If so, could this cause issues with padding tokens (necessary to make the prompt and negative_prompt the same length), as they would not be ignored?

Thank you, and your team, for all the hard work btw.

hckj588ku · 2023-07-19T10:15:05Z

Hey @jslegers,

It's true that our documentation is currently lacking behind a bit. Would you be interested in contributing a doc page about long prompting?

Also note that I would suggest to just use StableDiffusionPipeline and pass the prompt_embeds manually, e.g. the following code snippet works:

from diffusers import StableDiffusionPipeline
import torch

# 1. load model
model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

# 2. Forward embeddings and negative embeddings through text encoder
prompt = 25 * "a photo of an astronaut riding a horse on mars"
max_length = pipe.tokenizer.model_max_length

input_ids = pipe.tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to("cuda")

negative_ids = pipe.tokenizer("", truncation=False, padding="max_length", max_length=input_ids.shape[-1], return_tensors="pt").input_ids                                                                                                     
negative_ids = negative_ids.to("cuda")

concat_embeds = []
neg_embeds = []
for i in range(0, input_ids.shape[-1], max_length):
    concat_embeds.append(pipe.text_encoder(input_ids[:, i: i + max_length])[0])
    neg_embeds.append(pipe.text_encoder(negative_ids[:, i: i + max_length])[0])

prompt_embeds = torch.cat(concat_embeds, dim=1)
negative_prompt_embeds = torch.cat(neg_embeds, dim=1)

# 3. Forward
image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_prompt_embeds).images[0]
image.save("astronaut_rides_horse.png")

Could you try out whether this fits your use case? Would you be interested in adding a doc page about long-prompting maybe under: https://github.com/huggingface/diffusers/tree/main/docs/source/en/using-diffusers

How can I use this by StableDIffusionXlPipeline

ManjuVajra · 2023-07-25T15:52:53Z

This is still an issue with diffusers use of CLIP in general. No feedback on if that code snippet currently works. I'll test it myself. Either way, it is still an issue.

patrickvonplaten · 2023-08-02T18:01:16Z

I don't think this is an issue. If you want to overcome the 77 tokens limit, I highly recommend using the compel library: https://github.com/damian0815/compel#compel

Atlas3DSS · 2023-11-15T07:38:27Z

Thanks! That worked after I ran into some "I'm an idiot" errors. For anyone else that lands on here, read the Compel docs carefully - particularly this bit if you're using long tokens or you will get into mismatching tensor issues and be very sad.

compel = Compel(..., truncate_long_prompts=False)
prompt = "a cat playing with a ball++ in the forest, amazing, exquisite, stunning, masterpiece, skilled, powerful, incredible, amazing, trending on gregstation, greg, greggy, greggs greggson, greggy mcgregface, ..." # very long prompt
conditioning = compel.build_conditioning_tensor(prompt)
negative_prompt = "" # it's necessary to create an empty prompt - it can also be very long, if you want
negative_conditioning = compel.build_conditioning_tensor(negative_prompt)
[conditioning, negative_conditioning] = compel.pad_conditioning_tensors_to_same_length([conditioning, negative_conditioning])

My brother in christ i thank you for this! I was struggling and you saved me ;D

o5faruk · 2023-11-28T13:50:17Z

self.txt2img_pipe.load_textual_inversion(
        EMBEDDING_PATHS, token=EMBEDDING_TOKENS, local_files_only=True
)

textual_inversion_manager = DiffusersTextualInversionManager(self.txt2img_pipe)


self.compel_proc = Compel(
    tokenizer=self.txt2img_pipe.tokenizer,
    text_encoder=self.txt2img_pipe.text_encoder,
    textual_inversion_manager=textual_inversion_manager,
    truncate_long_prompts=False,
)
if prompt:
    conditioning = self.compel_proc.build_conditioning_tensor(prompt)
    if not negative_prompt:
        negative_prompt = ""  # it's necessary to create an empty prompt - it can also be very long, if you want
    negative_conditioning = self.compel_proc.build_conditioning_tensor(
        negative_prompt
    )
    [
        prompt_embeds,
        negative_prompt_embeds,
    ] = self.compel_proc.pad_conditioning_tensors_to_same_length(
        [conditioning, negative_conditioning]
    )
    ...
    output = pipe(
        prompt_embeds=prompt_embeds,
        negative_prompt_embeds=negative_prompt_embeds,
        guidance_scale=guidance_scale,
        generator=generator,
        num_inference_steps=num_inference_steps,
        **extra_kwargs,
    )

Im having weird issues, all the relevant code is shown above, however, negative_prompt messes up my image results, almost as if negatives are getting mixed up with positives.
Also, this happens only if prompt and negative prompt length exceeds 77 tokens.
extra_kwargs does not contain prompt or negative_prompt so only embeds are passed into pipeline. The pipeline in this case is controlnet text to image

Is it possible that negatives get mixed up into positives in pad_conditioning_tensors_to_same_length function?

This is my image with long negative prompt

And this is same seed, same prompt, no negative

lusp75 · 2024-01-13T15:35:20Z

get_pipeline_embeds

Which file should I modify? a py. file? Is it a file to add? Thank you!

andrevanzuydam · 2024-01-15T09:40:10Z

get_pipeline_embeds
Which file should I modify? a py. file? Is it a file to add? Thank you!

Hi @lusp75 if you look at my example way above, I just defined a method and used it in my code, dangerous to hack maintained libraries.

HamenderSingh · 2024-02-19T14:41:59Z

Just for in case someone comes accross this issue and wants a solution, I built something that works for both prompts correctly of varying lengths

from diffusers import StableDiffusionPipeline
import torch
import random

# 1. load model
model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")


pipe.enable_sequential_cpu_offload() # my graphics card VRAM is very low


def get_pipeline_embeds(pipeline, prompt, negative_prompt, device):
    """ Get pipeline embeds for prompts bigger than the maxlength of the pipe
    :param pipeline:
    :param prompt:
    :param negative_prompt:
    :param device:
    :return:
    """
    max_length = pipeline.tokenizer.model_max_length

    # simple way to determine length of tokens
    count_prompt = len(prompt.split(" "))
    count_negative_prompt = len(negative_prompt.split(" "))

    # create the tensor based on which prompt is longer
    if count_prompt >= count_negative_prompt:
        input_ids = pipeline.tokenizer(prompt, return_tensors="pt", truncation=False).input_ids.to(device)
        shape_max_length = input_ids.shape[-1]
        negative_ids = pipeline.tokenizer(negative_prompt, truncation=False, padding="max_length",
                                          max_length=shape_max_length, return_tensors="pt").input_ids.to(device)

    else:
        negative_ids = pipeline.tokenizer(negative_prompt, return_tensors="pt", truncation=False).input_ids.to(device)
        shape_max_length = negative_ids.shape[-1]
        input_ids = pipeline.tokenizer(prompt, return_tensors="pt", truncation=False, padding="max_length",
                                       max_length=shape_max_length).input_ids.to(device)

    concat_embeds = []
    neg_embeds = []
    for i in range(0, shape_max_length, max_length):
        concat_embeds.append(pipeline.text_encoder(input_ids[:, i: i + max_length])[0])
        neg_embeds.append(pipeline.text_encoder(negative_ids[:, i: i + max_length])[0])

    return torch.cat(concat_embeds, dim=1), torch.cat(neg_embeds, dim=1)


prompt = (22 + random.randint(1, 10)) * "a photo of an astronaut riding a horse on mars"
negative_prompt = (22 + random.randint(1, 10)) * "some negative texts"

print("Our inputs ", prompt, negative_prompt, len(prompt.split(" ")), len(negative_prompt.split(" ")))

prompt_embeds, negative_prompt_embeds = get_pipeline_embeds(pipe, prompt, negative_prompt, "cuda")

image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_prompt_embeds).images[0]

image.save("done.png")

There is a bug causing error when prompt is bigger compared to negative prompt.
I've fixed it below.

def get_pipeline_embeds(pipeline, prompt, negative_prompt, device):
    """ Get pipeline embeds for prompts bigger than the maxlength of the pipe
    :param pipeline:
    :param prompt:
    :param negative_prompt:
    :param device:
    :return:
    """
    max_length = pipeline.tokenizer.model_max_length


    # simple way to determine length of tokens
    input_ids = pipeline.tokenizer(prompt, return_tensors="pt", truncation=False).input_ids.to(device)
    negative_ids = pipeline.tokenizer(negative_prompt, return_tensors="pt", truncation=False).input_ids.to(device)

    # create the tensor based on which prompt is longer
    if input_ids.shape[-1] >= negative_ids.shape[-1]:
        shape_max_length = input_ids.shape[-1]
        negative_ids = pipeline.tokenizer(negative_prompt, truncation=False, padding="max_length",
                                          max_length=shape_max_length, return_tensors="pt").input_ids.to(device)

    else:
        shape_max_length = negative_ids.shape[-1]
        input_ids = pipeline.tokenizer(prompt, return_tensors="pt", truncation=False, padding="max_length",
                                       max_length=shape_max_length).input_ids.to(device)

    concat_embeds = []
    neg_embeds = []
    for i in range(0, shape_max_length, max_length):
        concat_embeds.append(pipeline.text_encoder(input_ids[:, i: i + max_length])[0])
        neg_embeds.append(pipeline.text_encoder(negative_ids[:, i: i + max_length])[0])

    return torch.cat(concat_embeds, dim=1), torch.cat(neg_embeds, dim=1)

The original prompt embeddings will cause the pipeline to crash in the case the negative prompt was longer than the prompt. Implemented a fix suggested in huggingface/diffusers#2136.

zhentingqi · 2024-03-17T17:32:52Z

Hi, so is there any other solution to this problem: How can we prompt diffusion models with more than 77 tokens?

I see the following code snippet but it seems that it just split the text into chunks and encode the chunks one by one, instead of encoding the entire text sequence?

Just for in case someone comes accross this issue and wants a solution, I built something that works for both prompts correctly of varying lengths

from diffusers import StableDiffusionPipeline
import torch
import random

# 1. load model
model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")


pipe.enable_sequential_cpu_offload() # my graphics card VRAM is very low


def get_pipeline_embeds(pipeline, prompt, negative_prompt, device):
    """ Get pipeline embeds for prompts bigger than the maxlength of the pipe
    :param pipeline:
    :param prompt:
    :param negative_prompt:
    :param device:
    :return:
    """
    max_length = pipeline.tokenizer.model_max_length

    # simple way to determine length of tokens
    count_prompt = len(prompt.split(" "))
    count_negative_prompt = len(negative_prompt.split(" "))

    # create the tensor based on which prompt is longer
    if count_prompt >= count_negative_prompt:
        input_ids = pipeline.tokenizer(prompt, return_tensors="pt", truncation=False).input_ids.to(device)
        shape_max_length = input_ids.shape[-1]
        negative_ids = pipeline.tokenizer(negative_prompt, truncation=False, padding="max_length",
                                          max_length=shape_max_length, return_tensors="pt").input_ids.to(device)

    else:
        negative_ids = pipeline.tokenizer(negative_prompt, return_tensors="pt", truncation=False).input_ids.to(device)
        shape_max_length = negative_ids.shape[-1]
        input_ids = pipeline.tokenizer(prompt, return_tensors="pt", truncation=False, padding="max_length",
                                       max_length=shape_max_length).input_ids.to(device)

    concat_embeds = []
    neg_embeds = []
    for i in range(0, shape_max_length, max_length):
        concat_embeds.append(pipeline.text_encoder(input_ids[:, i: i + max_length])[0])
        neg_embeds.append(pipeline.text_encoder(negative_ids[:, i: i + max_length])[0])

    return torch.cat(concat_embeds, dim=1), torch.cat(neg_embeds, dim=1)


prompt = (22 + random.randint(1, 10)) * "a photo of an astronaut riding a horse on mars"
negative_prompt = (22 + random.randint(1, 10)) * "some negative texts"

print("Our inputs ", prompt, negative_prompt, len(prompt.split(" ")), len(negative_prompt.split(" ")))

prompt_embeds, negative_prompt_embeds = get_pipeline_embeds(pipe, prompt, negative_prompt, "cuda")

image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_prompt_embeds).images[0]

image.save("done.png")

By the way, I hope someone could also help me with the same token limit problem with CLIP models. Is there any long-context image-text model? Or do I have to fine-tune a long-context CLIP-like model on my own? THanks!

chirag4798 · 2024-06-21T05:01:19Z

@jslegers Since there is no out of the box solution for using long prompts, I've created a PR: huggingface/transformers#31521 to address this, it would be helpful if you re-open this issue. After this is merged I can make necessary changes in huggingface/diffusers

MohamedAliRashad · 2024-06-28T11:33:21Z

Are these solutions workable with SD3 ?

asomoza · 2024-06-28T17:56:25Z

@MohamedAliRashad in this issue we discussed the CLIP token limit for SD3.

I recommend to use this library to do long prompt weighting for SD3.

larinius · 2024-10-09T19:58:49Z

After some struggle with the recommended sd_embed, which didn’t work for me for unknown reasons, I used the Compel prompt as-is, and it worked with SD 1.5 LCM 8 steps.
I also use a simple function that formats human-readable text into Compel format.

# Process the prompt and negative prompt
prompt_conditioning = compel_proc.build_conditioning_tensor(formatted_prompt)
negative_conditioning = compel_proc.build_conditioning_tensor(formatted_negative_prompt if formatted_negative_prompt else "")

# Pad the conditioning tensors to the same length
[prompt_conditioning, negative_conditioning] = compel_proc.pad_conditioning_tensors_to_same_length(
    [prompt_conditioning, negative_conditioning]
)

image = pipe(
                prompt_embeds=prompt_conditioning,
                negative_prompt_embeds=negative_conditioning,
                ...
            )

prompt_sifi = """
adult female with athletic build.
shoulder-length red hair.
wearing white metal combat armor.
wearing a white metal catsuit.
wearing a metal utility belt.
wearing white metal gloves.
medic red cross emblem on the chest.
wearing a professional-looking protective fighter pilot helmet.
wearing a visor.
futuristic background with neon-lit skyscrapers.
hovering vehicles in the distance.
holographic billboards advertising advanced technology.
floating drones patrolling the skies.
glowing energy pathways on the ground.
cybernetic enhancements visible on passersby.
a sleek, futuristic cityscape at dusk.
reflective surfaces casting vibrant, colorful lights.
a sense of technological advancement and urban sophistication.
"""

Looks like even last fragments of prompt not lost. But warning still present:

Token indices sequence length is longer than the specified maximum sequence length for this model (136 > 77). Running this sequence through the model will result in indexing errors

This was referenced Jan 27, 2023

Adding out-of-the-box support for multilingual models like BAAI/AltDiffusion to diffusers #2135

Closed

Switching to DPM++ SDE Karras sampling method #2064

Closed

patrickvonplaten mentioned this issue Jan 31, 2023

Weighted Prompts for Diffusers stable diffusion pipeline #1506

Closed

github-actions bot added the stale Issues that haven't received updates label Feb 26, 2023

github-actions bot closed this as completed Mar 6, 2023

ariffammobox mentioned this issue Apr 19, 2023

Overcoming 77 token limit pythoninoffice/amd_webui#16

Open

BEpresent mentioned this issue Jun 1, 2023

Clarification needed: Token indices sequence length is longer ... during inference damian0815/compel#20

Closed

jgranie mentioned this issue Jun 19, 2023

If the keyword is long "Needed to truncate input" A phrase appears. is this normal? apple/ml-stable-diffusion#108

Open

seruva19 mentioned this issue Jul 13, 2023

Feature Wishlist seruva19/kubin#85

Open

Rmond mentioned this issue Aug 4, 2023

[SD-XL] Passing prompt_embeds/negative_prompt_embeds requires also passing pooled_prompt_embeds/negative_pooled_prompt_embeds #4043

Closed

ghost mentioned this issue Oct 19, 2023

Add support for longer prompts than 77 tokens using compel replicate/cog-sdxl#35

Open

yhdanid mentioned this issue Jan 22, 2024

Dealing with CLIP's 77 token limit ZHO-ZHO-ZHO/ComfyUI-PhotoMaker-ZHO#49

Open

ytzfhqs mentioned this issue Feb 1, 2024

Using the compel library to processing promp natsunoyuki/diffuser_tools#1

Closed

LaurentMazare mentioned this issue Mar 21, 2024

Stable Diffusion fails to produce images with bad "Shape" above 80 or so characters prompt input huggingface/candle#1839

Closed

jdp8 mentioned this issue Apr 25, 2024

Infinite Prompt Length Feature dakenf/diffusers.js#19

Merged

chirag4798 mentioned this issue Jun 20, 2024

Sequence Length Invariant Text Models huggingface/transformers#31521

Closed

5 tasks

jslegers mentioned this issue Jul 31, 2024

Overcoming the 77 token limit in diffusers ZHO-ZHO-ZHO/ComfyUI-InstantID#108

Open

mykeehu mentioned this issue Oct 13, 2024

77 tokens limit ? rupeshs/fastsdcpu#261

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overcoming the 77 token limit in diffusers #2136

Overcoming the 77 token limit in diffusers #2136

jslegers commented Jan 27, 2023 •

edited

Loading

apolinario commented Jan 27, 2023

jslegers commented Jan 27, 2023

patrickvonplaten commented Jan 31, 2023

jslegers commented Jan 31, 2023

patrickvonplaten commented Jan 31, 2023

jslegers commented Feb 1, 2023

github-actions bot commented Feb 26, 2023

romanfurman6 commented Apr 11, 2023

patrickvonplaten commented Apr 12, 2023

andrevanzuydam commented Apr 19, 2023

andrevanzuydam commented Apr 19, 2023 •

edited

Loading

andrevanzuydam commented Apr 20, 2023

djj0s3 commented Jun 2, 2023

patrickvonplaten commented Jun 5, 2023

djj0s3 commented Jun 6, 2023 •

edited

Loading

DevonPeroutky commented Jun 28, 2023 •

edited

Loading

hckj588ku commented Jul 19, 2023

ManjuVajra commented Jul 25, 2023 •

edited

Loading

patrickvonplaten commented Aug 2, 2023

Atlas3DSS commented Nov 15, 2023

o5faruk commented Nov 28, 2023 •

edited

Loading

lusp75 commented Jan 13, 2024

andrevanzuydam commented Jan 15, 2024

HamenderSingh commented Feb 19, 2024

zhentingqi commented Mar 17, 2024

chirag4798 commented Jun 21, 2024 •

edited

Loading

MohamedAliRashad commented Jun 28, 2024

asomoza commented Jun 28, 2024

larinius commented Oct 9, 2024

Overcoming the 77 token limit in diffusers #2136

Overcoming the 77 token limit in diffusers #2136

Comments

jslegers commented Jan 27, 2023 • edited Loading

Description of the problem

The solution I'd like

Alternatives I've considered

Additional context

apolinario commented Jan 27, 2023

jslegers commented Jan 27, 2023

patrickvonplaten commented Jan 31, 2023

jslegers commented Jan 31, 2023

patrickvonplaten commented Jan 31, 2023

jslegers commented Feb 1, 2023

github-actions bot commented Feb 26, 2023

romanfurman6 commented Apr 11, 2023

patrickvonplaten commented Apr 12, 2023

andrevanzuydam commented Apr 19, 2023

andrevanzuydam commented Apr 19, 2023 • edited Loading

andrevanzuydam commented Apr 20, 2023

djj0s3 commented Jun 2, 2023

patrickvonplaten commented Jun 5, 2023

djj0s3 commented Jun 6, 2023 • edited Loading

DevonPeroutky commented Jun 28, 2023 • edited Loading

hckj588ku commented Jul 19, 2023

ManjuVajra commented Jul 25, 2023 • edited Loading

patrickvonplaten commented Aug 2, 2023

Atlas3DSS commented Nov 15, 2023

o5faruk commented Nov 28, 2023 • edited Loading

lusp75 commented Jan 13, 2024

andrevanzuydam commented Jan 15, 2024

HamenderSingh commented Feb 19, 2024

zhentingqi commented Mar 17, 2024

chirag4798 commented Jun 21, 2024 • edited Loading

MohamedAliRashad commented Jun 28, 2024

asomoza commented Jun 28, 2024

larinius commented Oct 9, 2024

jslegers commented Jan 27, 2023 •

edited

Loading

andrevanzuydam commented Apr 19, 2023 •

edited

Loading

djj0s3 commented Jun 6, 2023 •

edited

Loading

DevonPeroutky commented Jun 28, 2023 •

edited

Loading

ManjuVajra commented Jul 25, 2023 •

edited

Loading

o5faruk commented Nov 28, 2023 •

edited

Loading

chirag4798 commented Jun 21, 2024 •

edited

Loading