An image speaks a thousand words but can everyone listen? On image transcreation for cultural relevance

This is the official implementation of the paper An image speaks a thousand words but can everyone listen? On image transcreation for cultural relevance by Simran Khanuja, Sathyanarayanan Ramamoorthy, Yueqi Song, and Graham Neubig.

Abstract

Given the rise of multimedia content, human translators increasingly focus on culturally adapting not only words but also other modalities such as images to convey the same meaning. While several applications stand to benefit from this, machine translation systems remain confined to dealing with language in speech and text. In this work, we take a first step towards translating images to make them culturally relevant. First, we build three pipelines comprising state-of-the-art generative models to do the task. Next, we build a two-part evaluation dataset: i) concept: comprising 600 images that are cross-culturally coherent, focusing on a single concept per image, and ii) application: comprising 100 images curated from real-world applications. We conduct a multi-faceted human evaluation of translated images to assess for cultural relevance and meaning preservation. We find that as of today, image-editing models fail at this task, but can be improved by leveraging LLMs and retrievers in the loop. Best pipelines can only translate 5% of images for some countries in the easier concept dataset and no translation is successful for some countries in the application dataset, highlighting the challenging nature of the task.

Test Data

The test data curated in this paper can be found under the data folder. We have also uploaded this data on Zeno for visualization purposes.

Concept Dataset Application Dataset

Code

Setup

Create conda environment using environment.yml. The python version used is Python 3.10.12.

conda env create -f environment.yml
conda activate transcreation

Pipeline 1: e2e-instruct

This pipeline uses the InstructPix2Pix model to make an image culturally relevant to a target country. This is an end-to-end pipeline which only makes use of one image-editing model. To run the pipeline for a specific list of countries or all the countries, use the following commands:

bash ./scripts/part1/e2e-instruct.sh india,japan
bash ./scripts/part1/e2e-instruct.sh all

Pipeline 2: cap-edit

In this pipeline, we first caption the images, edit the captions to make them culturally relevant, and use these captions to edit the original image. The steps to run this pipeline are given below:

Step 1: Get image captions using InstructBLIP and edit them using GPT-3.5

First, enter your OPENAI_API_KEY in ./configs/part1/caption-llm_edit/make_configs.sh.

Next, we caption the images using InstructBLIP and edit them for cultural relevance, using GPT-3.5. For this, we first need to make config files for each country. To do this, run the following command:

bash ./configs/part1/caption-llm_edit/make_configs.sh

To run this step for a specific list of countries or all the countries, use the following commands:

bash ./scripts/part1/caption-llm_edit.sh portugal,turkey
bash ./scripts/part1/caption-llm_edit.sh all

Step 2: Edit images using LLM-edits and PlugnPlay

We've made modifications on top of https://github.com/MichalGeyer/pnp-diffusers. Kindly clone the fork of this repository from https://github.com/simran-khanuja/pnp-diffusers under ./src/pipelines/cap-edit. Follow through their README and first create the pnp-diffusers environment. Image-editing using the plugnplay model involves two stages: a) obtain the noisy latents if the original image; and b) image-editing as per text guidance. To obtain latents for a specific list of countries or all the countries, use the following commands:

bash ./scripts/part1/step1_pnp_preprocess.sh brazil,nigeria
bash ./scripts/part1/step1_pnp_preprocess.sh all

To edit images according using the LLM edits as text guidance for a specific list of countries or all the countries, use the following commands:

bash ./scripts/part1/step2_pnp_img-edit.sh brazil,nigeria
bash ./scripts/part1/step1_pnp_preprocess.sh all

Pipeline 3: cap-retrieve

Step 1: Get image captions and edit them using GPT-3.5

This step is the same as for cap-edit. There is no need to run anything here if step-1 of cap-edit is already run. Else, follow instructions from above to get captions and LLM edits.

Step 2: Retrieve images from LAION-{COUNTRY} using LLM-edits as text queries

Here, we will use the LLM-edits to retrieve relevant images from LAION. Since LAION has been redacted, this code is modified to use Datacomp-1b instead. To do this, first create a fresh environment by running:

conda create -n clip-ret-env python=3.10
conda activate clip-ret-env

We leverage the clip-retrieval infrastructure to obtain indices in a scalable and efficient way. After activating the environment, run:

pip install clip-retrieval

Now, we will create country-specific subsets of Datacomp-1b. Navigate to ./src/pipelines/cap-retrieve/prepare_datacomp and run categorize_cctld.py to create json files of image paths for each country. This script iterates over datacomp and creates country-specific json files based on the CCTLD of the image url. The list of CCTLDs can be found in ./src/pipelines/cap-retrieve/prepare_datacomp/cctld.tsv. Next, run the following commands to create datasets from images, get embeddings for images and text, and create indices from the embeddings, respectively. We will query these indices to retrieve relevant images given the LLM-edited caption as a text query:

bash ./src/pipelines/cap-retrieve/prepare_datacomp/step1-img2dataset.sh
bash ./src/pipelines/cap-retrieve/prepare_datacomp/step2-embeddings.sh
bash ./src/pipelines/cap-retrieve/prepare_datacomp/step3-index.sh

Now, to retrieve images from Datacomp given a text query (here, this is the LLM-edited captions obtained in Step-1), run the following:

bash ./scripts/part1/cap-retrieve.sh

Model Outputs (Per Pipeline)

Pipeline-1

United-States-Concept

United-States-Application

Pipeline-2

[Brazil-Application]

[India-Application]

[Japan-Application]

[Nigeria-Application]

Portugal-Concept

[Portugal-Application]

Turkey-Concept

[Turkey-Application]

United-States-Concept

[United-States-Application]

Pipeline-3

United-States-Concept

United-States-Application

Bonus: GPT-4o + GPT-4 + DALL-E3

Here, we caption models using GPT-4o and ask GPT-4 to edit captions for cultural relevance. Finally, we prompt DALL-E3 to generate new images using the GPT4-edited caption.

United-States-Concept

United-States-Application

Bonus: GPT-4o + GPT-4 (multilingual captioning+editing) + DALL-E3

Here, we caption models using GPT-4o and ask GPT-4 to edit captions for cultural relevance in a language primarily spoken in that country. Finally, we prompt DALL-E3 to generate new images using the GPT4-edited caption.

Model Outputs (Human Evaluation)

If y'all want to visualize model outputs for each part, please refer to the zeno links below. Note that the outputs were randomized for human evaluation, can you guess which pipeline each generated image is from? 😉

Citation

If you find this work useful in your research, please cite:

@article{khanuja2024image,
  title={An image speaks a thousand words, but can everyone listen? On translating images for cultural relevance},
  author={Khanuja, Simran and Ramamoorthy, Sathyanarayanan and Song, Yueqi and Neubig, Graham},
  journal={arXiv preprint arXiv:2404.01247},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

An image speaks a thousand words but can everyone listen? On image transcreation for cultural relevance

Abstract

Test Data

Code

Setup

Pipeline 1: e2e-instruct

Pipeline 2: cap-edit

Step 1: Get image captions using InstructBLIP and edit them using GPT-3.5

Step 2: Edit images using LLM-edits and PlugnPlay

Pipeline 3: cap-retrieve

Step 1: Get image captions and edit them using GPT-3.5

Step 2: Retrieve images from LAION-{COUNTRY} using LLM-edits as text queries

Model Outputs (Per Pipeline)

Pipeline-1

Pipeline-2

Pipeline-3

Bonus: GPT-4o + GPT-4 + DALL-E3

Bonus: GPT-4o + GPT-4 (multilingual captioning+editing) + DALL-E3

Model Outputs (Human Evaluation)

Brazil

India

Japan

Nigeria

Portugal

Turkey

United States

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

An image speaks a thousand words but can everyone listen? On image transcreation for cultural relevance

Abstract

Test Data

Code

Setup

Pipeline 1: e2e-instruct

Pipeline 2: cap-edit

Step 1: Get image captions using InstructBLIP and edit them using GPT-3.5

Step 2: Edit images using LLM-edits and PlugnPlay

Pipeline 3: cap-retrieve

Step 1: Get image captions and edit them using GPT-3.5

Step 2: Retrieve images from LAION-{COUNTRY} using LLM-edits as text queries

Model Outputs (Per Pipeline)

Pipeline-1

Pipeline-2

Pipeline-3

Bonus: GPT-4o + GPT-4 + DALL-E3

Bonus: GPT-4o + GPT-4 (multilingual captioning+editing) + DALL-E3

Model Outputs (Human Evaluation)

Brazil

India

Japan

Nigeria

Portugal

Turkey

United States

Citation