Fine-Tuning Llama 2 or 3.1 with bilingual datasets #152

HURIMOZ · 2024-11-29T08:56:51Z

HURIMOZ
Nov 29, 2024

Hi Iʻm keen to test fine-tuning Llama 2 or 3.1 with my bilingual datasets. In the recipe provided for wmt22_with_TowerInstruct-llama2 I donʻt see a yaml config file to train from the Llama model.

Can you explain briefly how I can go about fine-tuning an LLM with my bilingual datasets?

Also, does the promptize_llama2.py script acts like a prompt to start the fine-tuning training?

francoishernandez · 2024-11-29T10:29:07Z

francoishernandez
Nov 29, 2024
Maintainer

Quoting myself from #147 for clarity:

The key here is the prompt format. Basically you need to "convert" your bilingual data in a single prompt usable by the decoder-only structure of such LLMS.
You can see how the prompt is structured here: https://github.com/eole-nlp/eole/blob/main/recipes/wmt22_with_TowerInstruct-llama2/promptize_llama2.py
@vince62s might elaborate a bit more on potential finetuning tricks.

As for your question:

Also, does the promptize_llama2.py script acts like a prompt to start the fine-tuning training?

This does not make much sense. What you need is some data to finetune on.
Basically you have to preprocess your bilingual data into a monolingual dataset, using a prompt format like the one referenced.

With (most) LLMs, instead of having an encoder encode the source data, and a decoder decode the target data, you only have a decoder which "reads" the prompt, and continues generation. So you need to show your model how it's supposed to perform such generation. Hence the prompt basically saying [Task][SRC][TGT].

Then you finetune the LLM like any other LLM (see llama2 finetuning for an example, can easily be extended to other models).
Once the LLM has been finetuned, you can apply the same prompt structure at inference to produce the translations.

5 replies

HURIMOZ Dec 3, 2024
Author

@francoishernandez the llama2 finetuning example is not bilingual data fed into the model. I need to know how bilingual data is fed into the model.

vince62s Dec 3, 2024
Maintainer

You need to understand:

what it means exactly to finetune a Decoder-only model and the link provided by François is a good one.
how to adapt the concept to bilingual which means creating a prompt in which you include the source text and the target text. For this look at my other post where it links to TowerBlocks this will show you some examples.

HURIMOZ Dec 4, 2024
Author

Thank you Vince. I still canʻt wrap my head around not mapping src lines to tgt lines. In the promptized file there is no German sentence, (or Tahitian in my case). But suppose both English and German lines are present in the promptized line, do I use that promptized file in the yaml finetune config file?

I was able to fine-tune the TowerInstruct-llama2 model with a config file template from the Eole recipe for fine-tuning Llama2 that I adapted. I used src_lang to tell which corpus is in English and which one is in Tahitian. I then merged the LoRA weights as per the command provided in the Llama2 recipe but I wonder why we donʻt actually use the weights from a particular step to merge with the TowerInstruct-llama2 model.

vince62s Dec 4, 2024
Maintainer

here is a quick example of what you need to do (example taken for europarl):

src_file = "en-de/europarl/europarl-v10.de-en.en"
tgt_file = "en-de/europarl/europarl-v10.de-en.de"
output_file = "en-de/europarl/europarl-v10.de-en.prompted"
with open(src_file, "rb") as src, open(tgt_file, "rb") as tgt, open(output_file, "w") as out_file:
    # Loop through each line in the input file
    for (srcline, tgtline) in zip(src, tgt):
        srcline = srcline.decode("utf-8")
        tgtline = tgtline.decode("utf-8")
        prompt = f"<|start_header_id|>user<|end_header_id|>｟newline｠Translate the following text from English into German.｟newline｠English: {srcline.strip()}｟newline｠German: <|eot_id|><|start_header_id|>assistant<|end_header_id|>｟newline｠{tgtline.strip()}"
        out_file.write(prompt + "\n")

This will give out the prompted file which will be the input for the Llama3 training / fietuning

But again this is with a single prompt and you take the risk of overfitting.

HURIMOZ Dec 5, 2024
Author

Thatʻs great, it helps a lot. Thank you, Vince.

vince62s · 2024-11-29T10:40:32Z

vince62s
Nov 29, 2024
Maintainer

Read this: https://huggingface.co/Unbabel/TowerInstruct-Mistral-7B-v0.2/discussions/3
You may want to use something like TowerBlocks because when you always use the same prompt format it may overfit and not give good results.

1 reply

francoishernandez Nov 29, 2024
Maintainer

Quick note on that: might be interesting to have some custom transform(s) to dynamically sample the prompt templates for these kind of use cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-Tuning Llama 2 or 3.1 with bilingual datasets #152

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Fine-Tuning Llama 2 or 3.1 with bilingual datasets #152

HURIMOZ Nov 29, 2024

Replies: 2 comments · 6 replies

francoishernandez Nov 29, 2024 Maintainer

HURIMOZ Dec 3, 2024 Author

vince62s Dec 3, 2024 Maintainer

HURIMOZ Dec 4, 2024 Author

vince62s Dec 4, 2024 Maintainer

HURIMOZ Dec 5, 2024 Author

vince62s Nov 29, 2024 Maintainer

francoishernandez Nov 29, 2024 Maintainer

HURIMOZ
Nov 29, 2024

Replies: 2 comments 6 replies

francoishernandez
Nov 29, 2024
Maintainer

HURIMOZ Dec 3, 2024
Author

vince62s Dec 3, 2024
Maintainer

HURIMOZ Dec 4, 2024
Author

vince62s Dec 4, 2024
Maintainer

HURIMOZ Dec 5, 2024
Author

vince62s
Nov 29, 2024
Maintainer

francoishernandez Nov 29, 2024
Maintainer