Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Everything CLIP related seems to break starting form transformers 4.28.0 #24857

Closed
4 tasks done
andreaferretti opened this issue Jul 17, 2023 · 8 comments
Closed
4 tasks done

Comments

@andreaferretti
Copy link

System Info

  • transformers version: 4.28.0
  • Platform: Linux-5.10.107+-x86_64-with-glibc2.31
  • Python version: 3.9.16
  • Huggingface_hub version: 0.16.4
  • Safetensors version: 0.3.1
  • PyTorch version (GPU?): 1.11.0+cu113 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help?

@amyeroberts

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

It seems to me that there is some regression starting from transformers 4.28.0 that affects the CLIP vision model and everything related to it.

In particular, I am having issue with

  • ClipSeg
  • the CLIPVisionModel proper.

ClipSeg

For ClipSeg, I am able to use it and get the expected masks, essentially by literally following the example here:

from transformers import AutoProcessor, CLIPSegForImageSegmentation
from PIL import Image
import requests

processor = AutoProcessor.from_pretrained("CIDAS/clipseg-rd64-refined")
model = CLIPSegForImageSegmentation.from_pretrained("CIDAS/clipseg-rd64-refined")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = ["a cat", "a remote", "a blanket"]
inputs = processor(text=texts, images=[image] * len(texts), padding=True, return_tensors="pt")

outputs = model(**inputs)

logits = outputs.logits
print(logits.shape)

Then logits contains the logits from which I can obtain a mask by something like

mask = torch.exp(logits)
mask /= mask.max()

I tested this and it works reliably until transformers 4.27.4. But with transformers 4.28.0, I get masks that are completely black regardless of the input image.

ClipVisionModel

This is harder to describe, since it relies on an internal model. I have trained a model that makes use of the image embeddings generated by ClipVisionModel for custom subject generation. Everything works well until transformers 4.27.4. If I switch to 4.28.0, the generated image changes completely. The only change is installing 4.28.0.

In fact, if I save the embeddings generated by CLIPVisionModel with the two different versions for any random image, I see that they are different. to be sure, this is how I generate image embeddings:

clip = CLIPModel.from_pretrained(...)
preprocessor = CLIPProcessor.from_pretrained(...)
...
encoded_data = preprocessor(
    text=prompts,
    images=images,
    return_tensors="pt",
    max_length=77,
    padding="max_length",
    truncation=True,
)
clip_output = clip(
    input_ids=encoded_data.input_ids,
    pixel_values=encoded_data.pixel_values,
)
image_embeds =clip.visual_projection(
    clip_output.vision_model_output.last_hidden_state
)

For reference, I am using clip-vit-large-patch14

Expected behavior

I would expect CLIPVisionModel to give the same result on the same image, both in 4.27.4 and in 4.28.0

@andreaferretti
Copy link
Author

andreaferretti commented Jul 17, 2023

Just to make something reproducible, here we can see that the output of CLIPProcessor changes. I run the script

from PIL import Image
import requests

import transformers
from torchvision.transforms.functional import to_tensor
from transformers import CLIPProcessor

processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
reference = to_tensor(image)

encoded_data = processor(
    text=[""],
    images=[reference],
    return_tensors="pt",
    max_length=77,
    padding="max_length",
    truncation=True,
)

print(transformers.__version__)
print(encoded_data.pixel_values.mean())

With 4.27.4 I get

4.27.4
tensor(0.2463)

With 4.28.0 I get

4.28.0
tensor(-1.6673)

@andreaferretti
Copy link
Author

I figured out the issue: the CLIPProcessor expects tensors in the range [0, 255], but only starting from transformers 4.28.0. This seems a pretty breaking change to me! If I multiply my tensor by 255, I get the right results

@NielsRogge
Copy link
Contributor

Hi,

Thanks for reporting. This seems related to #23096 and may be caused by #22458. cc @amyeroberts

@amyeroberts
Copy link
Collaborator

Hi @andreaferretti, thanks for raising this issue!

What's being observed, is actually a resolution of inconsistent behaviour of the previous CLIP feature extractors. I'll explain:

  • to_tensor() doesn't just convert to a pytorch tensor, it also rescales the values to be between 0 - 1
  • The deprecated feature extractors and image processors use Pillow for resizing their images.
  • Pillow requires that for RGB, pixel values are uint8 between 0-255.
  • Therefore input images with float values are upscaled and cast to uint8 before being converted to a PIL.Image.Image

In the previous behaviour, images after resizing kept their upscaled values. Currently, if an image was upscaled during resizing, the pixel values are downscaled back e.g. to between 0-1. This ensures that the user can set do_resize to True or False and the only difference in the output image is its size (and interpolated pixels). Previously, if you set do_resize=False, then your image pixel values are never upscaled, they remain between 0-1, would be downscaled again, as is happening now.

Rather than try to infer processor behaviour based on inputs, we keep the processing behaviour consistent and let the user explicitly control this. If you wish to input images with pixel values that have been downscaled, then you just need to tell the image processor not to do any additional scaling using the do_rescale flag:

outputs = image_processor(images, do_rescale=False)

Alternatively, you could pass in the images without calling to_tensor.

In the issues linked by @NielsRogge, this is also explained: #23096 (comment)

However, this is the second time a similar issue has been raised, indicating that the behaviour is unexpected. I'll think about how to best address this with documentation or possible warning within the code.

@andreaferretti
Copy link
Author

Yeah, it would be useful to add a warning mentioning do_rescale, as well as mention this issue in the documentation of CLIP and related models

@sayakpaul
Copy link
Member

sayakpaul commented Jan 31, 2024

I am still getting widely different results on the JAX implementation of scenic for CLIP and the one we have in transformers (PyTorch).

from transformers import CLIPImageProcessor, CLIPVisionModelWithProjection
import torch
from PIL import Image 

import jax
import numpy as np
from scenic.projects.baselines.clip import model as clip


def _clip_preprocess(images, size):
  target_shape = images.shape[:-3] + (size, size, images.shape[-1])
  images = jax.image.resize(images, shape=target_shape, method='bicubic')
  images = clip.normalize_image(images)

  return images

def get_image_in_format(image, size, format="pt"):
    images = np.array(image) / 255.
    images = np.expand_dims(images, 0)
    pp_images = _clip_preprocess(images, size)

    if format == "pt":
        inputs = {}
        inputs["pixel_values"] = torch.from_numpy(np.array(pp_images))
        inputs["pixel_values"] = inputs["pixel_values"].permute(0, 3, 1, 2)
        return inputs 

    inputs = pp_images
    return inputs

# Comes from https://huggingface.co/datasets/diffusers/docs-images/blob/main/amused/glowing_512_2.png
image = Image.open("glowing_512_2.png")
processor = CLIPImageProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
model = CLIPVisionModelWithProjection.from_pretrained("openai/clip-vit-large-patch14-336").eval()

inputs = get_image_in_format(image, processor.crop_size["height"], format="pt")
with torch.no_grad():
    output = model(**inputs)

temp = output.image_embeds[0, :4].numpy().flatten().tolist()
print(", ".join([str(f"{x:.4f}") for x in temp]))
print("=====Printing JAX model=====")


_CLIP_MODEL_NAME = 'vit_l14_336px'
_model = clip.MODELS[_CLIP_MODEL_NAME]()
_model_vars = clip.load_model_vars(_CLIP_MODEL_NAME)
input_image_size = clip.IMAGE_RESOLUTION[_CLIP_MODEL_NAME]

images = get_image_in_format(image, size=input_image_size, format="jax")
temp = np.asarray(image_embs[0, :4]).flatten().tolist()
print(", ".join([str(f"{x:.4f}") for x in temp]))

Gives:

-0.0898, 0.1304, 0.2402, -0.0378
=====Printing JAX model=====
-0.0046, 0.0068, 0.0124, -0.0020

for what seems to be quite different for the exact same input.

@sanchit-gandhi would you have a clue about it?

@NielsRogge
Copy link
Contributor

Hi,

Not sure if you're comparing apples-to-apples, when comparing the original CLIP repository to the Transformers one, they match: https://colab.research.google.com/drive/15ZhC32ovBKAU5JqC-kcIOntW_oU-JrkB?usp=sharing.

Scenic is not the original implementation of CLIP so there might be some differences. I would first check whether the Scenic implementation outputs the same logits as the OpenAI CLIP repository.

@sayakpaul
Copy link
Member

You are right:

import clip
import torch 
import jax
import numpy as np
from scenic.projects.baselines.clip import model as clip_scenic

inputs = np.random.randn(1, 336, 336, 3)
model, preprocess = clip.load("ViT-L/14@336px", device="cpu")

with torch.no_grad():
    image = torch.from_numpy(inputs.transpose(0, 3, 1, 2))
    image_features = model.encode_image(image).numpy()
    print(image_features.shape)

temp = image_features[0, :4].flatten().tolist()
print(", ".join([str(f"{x:.4f}") for x in temp]))
print("=====Printing JAX model=====")

_CLIP_MODEL_NAME = 'vit_l14_336px'
_model = clip_scenic.MODELS[_CLIP_MODEL_NAME]()
_model_vars = clip_scenic.load_model_vars(_CLIP_MODEL_NAME)

images = jax.numpy.array(inputs)
image_embs, _ = _model.apply(_model_vars, images, None)
print(image_embs.shape)
temp = np.asarray(image_embs[0, :4]).flatten().tolist()
print(", ".join([str(f"{x:.4f}") for x in temp]))

Gives:

(1, 768)
-0.1827, 0.7319, 0.8779, 0.4829
=====Printing JAX model=====
(1, 768)
-0.0107, 0.0429, 0.0514, 0.0283

Sorry for the false alarm here. Have raised an issue: google-research/scenic#991.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants