Everything CLIP related seems to break starting form transformers 4.28.0 #24857

andreaferretti · 2023-07-17T13:09:51Z

System Info

transformers version: 4.28.0
Platform: Linux-5.10.107+-x86_64-with-glibc2.31
Python version: 3.9.16
Huggingface_hub version: 0.16.4
Safetensors version: 0.3.1
PyTorch version (GPU?): 1.11.0+cu113 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

@amyeroberts

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

It seems to me that there is some regression starting from transformers 4.28.0 that affects the CLIP vision model and everything related to it.

In particular, I am having issue with

ClipSeg
the CLIPVisionModel proper.

ClipSeg

For ClipSeg, I am able to use it and get the expected masks, essentially by literally following the example here:

from transformers import AutoProcessor, CLIPSegForImageSegmentation
from PIL import Image
import requests

processor = AutoProcessor.from_pretrained("CIDAS/clipseg-rd64-refined")
model = CLIPSegForImageSegmentation.from_pretrained("CIDAS/clipseg-rd64-refined")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = ["a cat", "a remote", "a blanket"]
inputs = processor(text=texts, images=[image] * len(texts), padding=True, return_tensors="pt")

outputs = model(**inputs)

logits = outputs.logits
print(logits.shape)

Then logits contains the logits from which I can obtain a mask by something like

mask = torch.exp(logits)
mask /= mask.max()

I tested this and it works reliably until transformers 4.27.4. But with transformers 4.28.0, I get masks that are completely black regardless of the input image.

ClipVisionModel

This is harder to describe, since it relies on an internal model. I have trained a model that makes use of the image embeddings generated by ClipVisionModel for custom subject generation. Everything works well until transformers 4.27.4. If I switch to 4.28.0, the generated image changes completely. The only change is installing 4.28.0.

In fact, if I save the embeddings generated by CLIPVisionModel with the two different versions for any random image, I see that they are different. to be sure, this is how I generate image embeddings:

clip = CLIPModel.from_pretrained(...)
preprocessor = CLIPProcessor.from_pretrained(...)
...
encoded_data = preprocessor(
    text=prompts,
    images=images,
    return_tensors="pt",
    max_length=77,
    padding="max_length",
    truncation=True,
)
clip_output = clip(
    input_ids=encoded_data.input_ids,
    pixel_values=encoded_data.pixel_values,
)
image_embeds =clip.visual_projection(
    clip_output.vision_model_output.last_hidden_state
)

For reference, I am using clip-vit-large-patch14

Expected behavior

I would expect CLIPVisionModel to give the same result on the same image, both in 4.27.4 and in 4.28.0

The text was updated successfully, but these errors were encountered:

andreaferretti · 2023-07-17T14:59:46Z

Just to make something reproducible, here we can see that the output of CLIPProcessor changes. I run the script

from PIL import Image
import requests

import transformers
from torchvision.transforms.functional import to_tensor
from transformers import CLIPProcessor

processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
reference = to_tensor(image)

encoded_data = processor(
    text=[""],
    images=[reference],
    return_tensors="pt",
    max_length=77,
    padding="max_length",
    truncation=True,
)

print(transformers.__version__)
print(encoded_data.pixel_values.mean())

With 4.27.4 I get

4.27.4
tensor(0.2463)

With 4.28.0 I get

4.28.0
tensor(-1.6673)

andreaferretti · 2023-07-17T15:06:43Z

I figured out the issue: the CLIPProcessor expects tensors in the range [0, 255], but only starting from transformers 4.28.0. This seems a pretty breaking change to me! If I multiply my tensor by 255, I get the right results

NielsRogge · 2023-07-17T19:24:34Z

Hi,

Thanks for reporting. This seems related to #23096 and may be caused by #22458. cc @amyeroberts

amyeroberts · 2023-07-18T12:08:41Z

Hi @andreaferretti, thanks for raising this issue!

What's being observed, is actually a resolution of inconsistent behaviour of the previous CLIP feature extractors. I'll explain:

to_tensor() doesn't just convert to a pytorch tensor, it also rescales the values to be between 0 - 1
The deprecated feature extractors and image processors use Pillow for resizing their images.
Pillow requires that for RGB, pixel values are uint8 between 0-255.
Therefore input images with float values are upscaled and cast to uint8 before being converted to a PIL.Image.Image

In the previous behaviour, images after resizing kept their upscaled values. Currently, if an image was upscaled during resizing, the pixel values are downscaled back e.g. to between 0-1. This ensures that the user can set do_resize to True or False and the only difference in the output image is its size (and interpolated pixels). Previously, if you set do_resize=False, then your image pixel values are never upscaled, they remain between 0-1, would be downscaled again, as is happening now.

Rather than try to infer processor behaviour based on inputs, we keep the processing behaviour consistent and let the user explicitly control this. If you wish to input images with pixel values that have been downscaled, then you just need to tell the image processor not to do any additional scaling using the do_rescale flag:

outputs = image_processor(images, do_rescale=False)

Alternatively, you could pass in the images without calling to_tensor.

In the issues linked by @NielsRogge, this is also explained: #23096 (comment)

However, this is the second time a similar issue has been raised, indicating that the behaviour is unexpected. I'll think about how to best address this with documentation or possible warning within the code.

andreaferretti · 2023-07-19T08:27:37Z

Yeah, it would be useful to add a warning mentioning do_rescale, as well as mention this issue in the documentation of CLIP and related models

sayakpaul · 2024-01-31T05:36:55Z

I am still getting widely different results on the JAX implementation of scenic for CLIP and the one we have in transformers (PyTorch).

from transformers import CLIPImageProcessor, CLIPVisionModelWithProjection
import torch
from PIL import Image 

import jax
import numpy as np
from scenic.projects.baselines.clip import model as clip


def _clip_preprocess(images, size):
  target_shape = images.shape[:-3] + (size, size, images.shape[-1])
  images = jax.image.resize(images, shape=target_shape, method='bicubic')
  images = clip.normalize_image(images)

  return images

def get_image_in_format(image, size, format="pt"):
    images = np.array(image) / 255.
    images = np.expand_dims(images, 0)
    pp_images = _clip_preprocess(images, size)

    if format == "pt":
        inputs = {}
        inputs["pixel_values"] = torch.from_numpy(np.array(pp_images))
        inputs["pixel_values"] = inputs["pixel_values"].permute(0, 3, 1, 2)
        return inputs 

    inputs = pp_images
    return inputs

# Comes from https://huggingface.co/datasets/diffusers/docs-images/blob/main/amused/glowing_512_2.png
image = Image.open("glowing_512_2.png")
processor = CLIPImageProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
model = CLIPVisionModelWithProjection.from_pretrained("openai/clip-vit-large-patch14-336").eval()

inputs = get_image_in_format(image, processor.crop_size["height"], format="pt")
with torch.no_grad():
    output = model(**inputs)

temp = output.image_embeds[0, :4].numpy().flatten().tolist()
print(", ".join([str(f"{x:.4f}") for x in temp]))
print("=====Printing JAX model=====")


_CLIP_MODEL_NAME = 'vit_l14_336px'
_model = clip.MODELS[_CLIP_MODEL_NAME]()
_model_vars = clip.load_model_vars(_CLIP_MODEL_NAME)
input_image_size = clip.IMAGE_RESOLUTION[_CLIP_MODEL_NAME]

images = get_image_in_format(image, size=input_image_size, format="jax")
temp = np.asarray(image_embs[0, :4]).flatten().tolist()
print(", ".join([str(f"{x:.4f}") for x in temp]))

Gives:

-0.0898, 0.1304, 0.2402, -0.0378
=====Printing JAX model=====
-0.0046, 0.0068, 0.0124, -0.0020

for what seems to be quite different for the exact same input.

@sanchit-gandhi would you have a clue about it?

NielsRogge · 2024-01-31T07:46:53Z

Hi,

Not sure if you're comparing apples-to-apples, when comparing the original CLIP repository to the Transformers one, they match: https://colab.research.google.com/drive/15ZhC32ovBKAU5JqC-kcIOntW_oU-JrkB?usp=sharing.

Scenic is not the original implementation of CLIP so there might be some differences. I would first check whether the Scenic implementation outputs the same logits as the OpenAI CLIP repository.

sayakpaul · 2024-01-31T08:06:16Z

You are right:

import clip
import torch 
import jax
import numpy as np
from scenic.projects.baselines.clip import model as clip_scenic

inputs = np.random.randn(1, 336, 336, 3)
model, preprocess = clip.load("ViT-L/14@336px", device="cpu")

with torch.no_grad():
    image = torch.from_numpy(inputs.transpose(0, 3, 1, 2))
    image_features = model.encode_image(image).numpy()
    print(image_features.shape)

temp = image_features[0, :4].flatten().tolist()
print(", ".join([str(f"{x:.4f}") for x in temp]))
print("=====Printing JAX model=====")

_CLIP_MODEL_NAME = 'vit_l14_336px'
_model = clip_scenic.MODELS[_CLIP_MODEL_NAME]()
_model_vars = clip_scenic.load_model_vars(_CLIP_MODEL_NAME)

images = jax.numpy.array(inputs)
image_embs, _ = _model.apply(_model_vars, images, None)
print(image_embs.shape)
temp = np.asarray(image_embs[0, :4]).flatten().tolist()
print(", ".join([str(f"{x:.4f}") for x in temp]))

Gives:

(1, 768)
-0.1827, 0.7319, 0.8779, 0.4829
=====Printing JAX model=====
(1, 768)
-0.0107, 0.0429, 0.0514, 0.0283

Sorry for the false alarm here. Have raised an issue: google-research/scenic#991.

NielsRogge mentioned this issue Jul 31, 2023

Incorrect segmentation results on float input in 4.31.0 #25195

Closed

4 tasks

huggingface deleted a comment from github-actions bot Aug 21, 2023

amyeroberts mentioned this issue Aug 23, 2023

ImageProcessor - check if input pixel values between 0-255 #25688

Merged

5 tasks

huggingface deleted a comment from github-actions bot Sep 15, 2023

ArthurZucker closed this as completed Sep 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Everything CLIP related seems to break starting form transformers 4.28.0 #24857

Everything CLIP related seems to break starting form transformers 4.28.0 #24857

andreaferretti commented Jul 17, 2023

andreaferretti commented Jul 17, 2023 •

edited

Loading

andreaferretti commented Jul 17, 2023

NielsRogge commented Jul 17, 2023

amyeroberts commented Jul 18, 2023

andreaferretti commented Jul 19, 2023

sayakpaul commented Jan 31, 2024 •

edited

Loading

NielsRogge commented Jan 31, 2024

sayakpaul commented Jan 31, 2024

Everything CLIP related seems to break starting form transformers 4.28.0 #24857

Everything CLIP related seems to break starting form transformers 4.28.0 #24857

Comments

andreaferretti commented Jul 17, 2023

System Info

Who can help?

Information

Tasks

Reproduction

ClipSeg

ClipVisionModel

Expected behavior

andreaferretti commented Jul 17, 2023 • edited Loading

andreaferretti commented Jul 17, 2023

NielsRogge commented Jul 17, 2023

amyeroberts commented Jul 18, 2023

andreaferretti commented Jul 19, 2023

sayakpaul commented Jan 31, 2024 • edited Loading

NielsRogge commented Jan 31, 2024

sayakpaul commented Jan 31, 2024

andreaferretti commented Jul 17, 2023 •

edited

Loading

sayakpaul commented Jan 31, 2024 •

edited

Loading