Add OLMo: 1B & 7B #927

rasbt · 2024-02-12T20:30:52Z

Adds the popular and fully open-source OLMo models by Allen AI.

Fixes #925

lit_gpt/config.py

rasbt · 2024-02-13T20:07:36Z

I'm a big stuck with the conversion and would appreciate your advice and ideas @carmocca or @Andrei-Aksionov!

So, here are 3 special things about Olmo:

they used weight tying like in GPT-2: They reuse the WTE weight as the output projection weight. The way they saved the tensors on the Hub though they simply duplicated that tensor so there shouldn't be any action required. When loading the model in HuggingFace, I checked that olmo.model.transformer.wte.weight and olmo.model.transformer.ff_out.weight contain the same tensor. That should be all good here.
They use a non-parametric LayerNorm. I.e., their LayerNorm doesn't have the scale (weight) and shift (bias) parameters. To avoid any code changes just for that model, my workaround is to just use zeros and ones so that these have no effect:

        state_dict[f"transformer.h.{l}.norm_1.weight"] = torch.ones(config.n_embd)
        state_dict[f"transformer.h.{l}.norm_2.weight"] = torch.ones(config.n_embd)
        state_dict[f"transformer.h.{l}.norm_1.bias"] = torch.zeros(config.n_embd)
        state_dict[f"transformer.h.{l}.norm_2.bias"] = torch.zeros(config.n_embd)

The problem is that I'm missing weights ...

The HF version is like this, which is confusing, because the ops are not applied in that "sequential" order as far as I can tell:

OLMoForCausalLM(
  (model): Olmo(
    (transformer): ModuleDict(
      (wte): Embedding(50304, 2048)
      (emb_drop): Dropout(p=0.0, inplace=False)
      (ln_f): LayerNorm()
      (blocks): ModuleList(
        (0-15): 16 x OlmoSequentialBlock(
          (dropout): Dropout(p=0.0, inplace=False)
          (act): SwiGLU()
          (attn_out): Linear(in_features=2048, out_features=2048, bias=False)
          (ff_out): Linear(in_features=8192, out_features=2048, bias=False)
          (rotary_emb): RotaryEmbedding()
          (attn_norm): LayerNorm()
          (ff_norm): LayerNorm()
          (att_proj): Linear(in_features=2048, out_features=6144, bias=False)
          (ff_proj): Linear(in_features=2048, out_features=16384, bias=False)
        )
      )
      (ff_out): Embedding(50304, 2048)
    )
  )
)

Unless I'm wrong, I think what happens is that ff_proj is a placeholder for the mlp F1 and FC2 layers. I.e., the first half is FC1 and the second half is FC2. It's kind of confusing though.

What I am thinking is that we have to split the fc weights, which would avoid us having to write some custom code in the GPT model class:

    weight_map = {
        "model.transformer.wte.weight": "transformer.wte.weight",
        "model.transformer.ff_out.weight": "lm_head.weight",
        "model.transformer.blocks.{}.attn_out.weight": "transformer.h.{}.attn.proj.weight",
        "model.transformer.blocks.{}.ff_proj.weight": "transformer.h.{}.mlp.fc_1.weight", # split into fc1 and fc2
        "model.transformer.blocks.{}.att_proj.weight": "transformer.h.{}.attn.attn.weight",
        "model.transformer.blocks.{}.ff_out.weight": "transformer.h.{}.mlp.proj.weight",
    }
...

    for l in range(config.n_layer):
        state_dict[f"transformer.h.{l}.mlp.fc_2.weight"] = state_dict[f"transformer.h.{l}.mlp.fc_1.weight"][config.n_embd:]
        state_dict[f"transformer.h.{l}.mlp.fc_1.weight"] = state_dict[f"transformer.h.{l}.mlp.fc_1.weight"][:config.n_embd]

is this somehow possible with the lit-gpt SaveProxyTensor?

carmocca · 2024-02-13T22:02:07Z

Hey! All your suggestions make sense to me. You should be able to split the combined ff linear as you suggest, especially if load_param has ben called already. We also manipulate the qkv linears for llama2 checkpoints in a similar way.

However, note that your workarounds will only work for inference. During training, wte and ff_out will not be tied and the layernorm parameters wont be frozen.

Andrei-Aksionov · 2024-02-14T15:11:37Z

Hello @rasbt

Looks like you are correct. I just wanted to add a couple of things that I've noticed while reviewing their code. For posterity, sorta speak.

I also don't like when the layers are initialized not in the order they are executed. Lit-GPT also does it: first we create lm_head and only then transformer layers 🙃.

So the order of execution should be as such:

# Attention
1. attn_norm
2. attn_proj (2048, 6144) <- combined QKV
3. rotary_emb
4. attn_out (2048, 2048)
5. dropout
# MLP
1. ff_norm
2. ff_proj (2048, 16384) <- combined [fc_2, fc_1] / [up, gate] in LlaMA notation
3. act
4. ff_out (8192, 2048)
5. dropout

Yes, they use weight_tying. It's configurable and they decided to use it. And yes, it won't work during training. Although it's not difficult to add if more models will use it.
Their LayerNorm class supports weight and bias parameters, but it's controlled by the config. It looks like they turned off .weight and .bias per config.
This deserves a bit more explanation.
In LlaMAMLP class we have fc_1, fc_2 and proj. During the forward pass we apply fc_1 and fc_2 on the input separately:
https://github.com/Lightning-AI/lit-gpt/blob/f5d68065ff621fc2cc190c05dcc4ab2cda1d1f57/lit_gpt/model.py#L286-L290

Olmo has only two layer: ff_proj and ff_out. They decided to take an approach that is similar to a combined QKV matrix and created ff_proj layer that does this matmul op in one go. But then, the way they split the result is I would say an unexpected - in the activation function:

def forward(self, x: torch.Tensor) -> torch.Tensor:
        x, gate = x.chunk(2, dim=-1)
        return F.silu(gate) * x

and then they apply ff_out to it.

Important to note the way they split and then apply activation function on a chunk. That means that:
ff_proj == [fc_2, fc_1]
ff_out == proj

rasbt · 2024-02-14T16:37:15Z

Thanks so much for the feedback @carmocca and @Andrei-Aksionov , this was super helpful! After more tinkering, I went with a custom OLMoMLP (analogous to LLaMALMLP) because I thought this was easier than the other workarounds -- both from an implementation perspective but also code-readability in the future.

The weights load ok now, but for some reason, the results are garbage. E.g., for

python generate/base.py --checkpoint_dir ./checkpoints/allenai/OLMo-1b/

What food do llamas eat?lerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslers

And

python generate/base.py --checkpoint_dir ./checkpoints/allenai/OLMo-7b/

What food do llamas eat? nic except ' Up Area has , climate * new area even county bun dressingall Bul Index millions Di withdrawal intent except / bun ID tonnes approve welcome St/ regimes health ng est African worse Multiple; p; ques Up ( IL'' Area / p

rasbt · 2024-02-14T16:44:10Z

Yes, they use weight_tying. It's configurable and they decided to use it. And yes, it won't work during training. Although it's not difficult to add if more models will use it.

Actually, upon further inspection they only use weight tying for the 1B (https://huggingface.co/allenai/OLMo-1B/blob/main/config.json#L42) model not for the 7B model (https://huggingface.co/allenai/OLMo-7B/blob/main/config.json#L42). I adjusted the code accordingly. Still not working well though.

carmocca · 2024-02-14T16:47:45Z

I would strongly prefer that we don't add this new MLP class.

To debug the output, you'll have to inspect the activations for both models layer by layer to see where they diverge

rasbt · 2024-02-14T16:54:16Z

I would strongly prefer that we don't add this new MLP class.

Ok! Maybe let's leave it in there until we got it to work, and then we can refactor it into one of the existing classes somehow.

rasbt · 2024-02-14T22:08:28Z

Just to add a note about pinpointing the difference. With Carlos's help, we found that the difference currently is in how the QKV matrix is split into queries, keys, and values.

https://github.com/Lightning-AI/lit-gpt/blob/main/lit_gpt/model.py#L195-L202

and

https://github.com/allenai/OLMo/blob/main/olmo/model.py#L687
https://github.com/allenai/OLMo/blob/main/olmo/model.py#L559-L571

In Lit-GPT, the Q, K, and V are interleaved (to also support MQA) whereas in OLMo, QKV are not interleaved.

We could potentially accommodate OLMo in Lit-GPT if we apply the steps here from Llama in the conversion script but in reverse: https://github.com/Lightning-AI/lit-gpt/blob/main/scripts/convert_hf_checkpoint.py#L182-L186

jwkirchenbauer · 2024-06-29T18:04:34Z

Hi there!
What is the status on olmo support?
I recognize how this PR relates to #1013 relating to the qkv implementation.

Andrei-Aksionov · 2024-06-29T18:16:14Z

Hello @jwkirchenbauer
I think #1341 is architectually similar to Olmo, so after it is merged, it should be much easier to implement olmo.
Somewhere next week, maybe.
Depends on when that PR is merged, plus, I guess Gemma 2 is in priority right now.
But I'll do my best to implement it as soon as possible since there is still interest 😉.

#1013 is a breaking change, so it might take time for reviewers to do their part.

rasbt · 2024-07-01T16:47:55Z

If this PR gets revived some time, we should check out the qkv_reassemble function from #1341

… HF instead."

Andrei-Aksionov · 2024-07-28T18:01:33Z

A tale of a PR that rose from ashes. 🙂

I tested both model, they generate normally looking text, though cannot answer a question.
As I understand, these models aren't finetuned ones, so that's most likely why.

@rasbt I didn't try to finetune it, but it looks like there are no significant architectural changes, so it should be ok.

cc @jwkirchenbauer Finally made time.

rasbt · 2024-07-29T19:43:09Z

Wow thanks for resurrecting it and pushing it forward!

Andrei-Aksionov · 2024-07-30T08:19:48Z

It looks like the instruct variant expects a prompt in a specific format:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-7B-Instruct-hf")

message = [{"role": "user", "content": "{prompt}"}]
display(tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=True))

"<|endoftext|><|user|>\n{prompt}\n<|assistant|>\n"

aflah02 · 2024-11-05T09:32:38Z

Hey
Is this PR still being worked on? I'm interested in contributing to the same if no one else is working on this but from what I understand it's pretty much finished with some testing left right? (I might be wrong though)
Thanks!

Andrei-Aksionov · 2024-11-05T09:54:24Z

Hello @aflah02
It's been a while since I worked on it, but, from what I remember, it should be finished (but not quite 🙃).

If you want to contribute (which is awesome), then you need to:

Resolve conflicts 🫠
Check the instruct version, add a template if needed, run a generation and post the full output (starting from the command) here

aflah02 · 2024-11-05T12:38:44Z

Thanks for the pointers @Andrei-Aksionov
Since I can't push to the same branch and it would be difficult to resolve all conflicts directly. I am thinking of creating a new PR copying over the current set of changes from this PR and build on top of that. Does that work?
Unfortunately, that would lead to loss of attribution of the commits to the people who've worked on this PR before.

Andrei-Aksionov · 2024-11-05T12:49:28Z

Fine by me.

Unfortunately, that would lead to loss of attribution of the commits to the people who've worked on this PR before.

Sebastian and I worked on it.
I personally don't mind. Finally merging it is more valuable for me.

What do you think, @rasbt ?

rasbt · 2024-11-05T14:18:55Z

Thanks for your interest in pushing this PR to the finish line @aflah02. Unfortunately I recently got swamped with other work responsibilities so I currently can't spend as much time as I wanted to on this and would definitely appreciate the contribution!

Unfortunately, that would lead to loss of attribution of the commits to the people who've worked on this PR be

I personally also don't mind the lack of attribution for me here. Thanks for doing this tedious task of copying over the branch!

aflah02 · 2024-11-05T16:01:46Z

Thanks @Andrei-Aksionov @rasbt
I'll work on a PR over the next week and share the link here once I'm ready

aflah02 · 2024-11-07T14:24:40Z

It looks like the instruct variant expects a prompt in a specific format:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-7B-Instruct-hf")

message = [{"role": "user", "content": "{prompt}"}]
display(tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=True))

"<|endoftext|><|user|>\n{prompt}\n<|assistant|>\n"

Hi @Andrei-Aksionov

I was looking at the chat template in the OLMo HF tokenizer config - https://huggingface.co/allenai/OLMo-7B-Instruct/blob/main/tokenizer_config.json#L242 and it looks much more complex than the one you've shared above partly due to the for loops etc. which I guess are used to keep applying the template to all messages in the conversation. I am not super familiar with how LitGPT adds these prompts but looking at prompts.py it seems that for some models the implementation is just to apply the template -

class Llama2(PromptStyle):
    def apply(self, prompt: str, **kwargs: str) -> str:
        b_inst, e_inst = "[INST]", "[/INST]"
        b_sys, e_sys = "<<SYS>>\n", "\n<</SYS>>\n\n"
        return (
            f"{b_inst} {b_sys}You are a helpful, respectful and honest assistant. Always answer as helpfully as"
            " possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist,"
            " toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and"
            " positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why"
            " instead of answering something not correct. If you don't know the answer to a question, please don't"
            f" share false information.{e_sys} {prompt} {e_inst} "
        )

While for others multiple cases like when the prompt is a str v/s a list and what are the stop_tokens are handled explicitly -

class Llama3(PromptStyle):
    def apply(self, prompt: Union[str, List[Dict[str, str]]], **kwargs: str) -> str:

        default_system_prompt = "You are a helpful assistant."

        # https://github.com/meta-llama/llama3/blob/359887376f0aaf30e433f23e25df858d8c2a9833/llama/tokenizer.py#L202-L229
        if isinstance(prompt, str):
            return (
                "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n"
                f"{default_system_prompt}<|eot_id|>" # No newline
                "<|start_header_id|>user<|end_header_id|>\n\n"
                f"{prompt}<|eot_id|>" # No newline
                "<|start_header_id|>assistant<|end_header_id|>\n\n"
            )
        elif isinstance(prompt, list):

            def encode_header(role: str) -> List[str]:
                return [f"<|start_header_id|>{role}<|end_header_id|>\n\n"]

            def encode_message(message: Dict[str, str]) -> List[str]:
                tokens = encode_header(message["role"])
                # NOTE: Meta stripped this. I'm not sure I agree, but who am I to argue?
                tokens.append(message["content"].strip())
                tokens.append("<|eot_id|>")
                return tokens

            def has_system_prompt(messages: List[Dict[str, str]]) -> bool:
                return messages[0].get("role", "") == "system" if len(messages) else False

            tokens = ["<|begin_of_text|>"]
            if not has_system_prompt(prompt):
                tokens.extend(encode_message({"role": "system", "content": default_system_prompt}))
            for i, message in enumerate(prompt):
                if i != 0 and message["role"] == "system":
                    raise ValueError("'system' role is only allowed at the beginning of the conversation list.")
                if not message["role"] in ["assistant", "user", "system"]:
                    raise ValueError(f"Unknown role: '{message['role']}'. Supported roles are 'assistant', 'user', and 'system'.")
                tokens.extend(encode_message(message))
            tokens.extend(encode_header("assistant"))
            return "".join(tokens)
        else:
            raise ValueError(f"Unsupported prompt type: {type(prompt)}")

    def stop_tokens(self, tokenizer: "Tokenizer") -> Tuple[List[int], ...]:
        return (
            [tokenizer.eos_id],
            [tokenizer.token_to_id("<|eot_id|>")],
        )

How does one figure out which route is needed for their specific model?

aflah02 · 2024-11-07T14:36:49Z

Also as I was copying over the changes I was also referring to this adding models guide (https://github.com/Lightning-AI/litgpt/blob/main/tutorials/developer-docs/adding-models.md) and I noticed that even without copying over the checkpoint conversion files it just worked out of the box. Is this expected? I thought it wouldn't given there was custom logic written to convert checkpoint from hf to lit and vice-a-versa

aflah02 · 2024-11-07T15:00:35Z

Here are the outputs from the 7B Instruct Model. These look good to me

aflah02 · 2024-11-07T15:05:22Z

Also raised a PR. Not all changes are done but some of the changes in this PR don't seem to be needed - https://github.com/Lightning-AI/litgpt/pull/1827/files

Andrei-Aksionov · 2024-11-07T17:43:07Z

Hello @aflah02

I believe only the prompt for Llama 3 supports multi-turn generation.
I think it was done for a better integration with LitServe.
Others only do a regular generation.
We have an opened issue to support multi-turn for other models: #1488

I think for now it's fine to add a template for the basic template and then, in later PRs, extend it.

Also as I was copying over the changes I was also referring to this adding models guide (https://github.com/Lightning-AI/litgpt/blob/main/tutorials/developer-docs/adding-models.md) and I noticed that even without copying over the checkpoint conversion files it just worked out of the box. Is this expected? I thought it wouldn't given there was custom logic written to convert checkpoint from hf to lit and vice-a-versa

I didn't notice that, but the script, that I wrote for converting weights, is actually a special case of copy_weights_hf_llama 🙂.

aflah02 · 2024-11-07T18:26:39Z

Thanks for the confirmation!

Andrei-Aksionov · 2024-11-13T17:52:49Z

Closing it as it was implemented in #1827

rasbt requested review from awaelchli, carmocca and lantiga as code owners February 12, 2024 20:30

rasbt added the model-weights label Feb 12, 2024

Andrei-Aksionov reviewed Feb 13, 2024

View reviewed changes

lit_gpt/config.py Outdated Show resolved Hide resolved

rasbt mentioned this pull request Mar 5, 2024

Drop interleave placement in QKV matrix #1013

Merged

rasbt added 9 commits March 18, 2024 22:12

wip olmo

96ef700

fix url

7a71d9f

hf checkpoint conversion

d49da7a

padded vocab

f165313

update config and conversion

c59908c

save proxytensors

1c23181

layernorm adjust

ea98353

fix weight loading with custom OLMoMLP

19444a2

add olmo unit test

c936358

rasbt force-pushed the olmo branch from 6f3d7b4 to c936358 Compare March 18, 2024 22:16

rasbt closed this Jul 3, 2024

Andrei-Aksionov added 2 commits July 28, 2024 17:24

Merge branch 'main' into olmo

522fcb7

Convert OLMo from HF to LitGPT format

8a9bff8

Andrei-Aksionov added 7 commits July 28, 2024 18:51

Remove OLMo specific (not needed) code

4746a3e

Convert from LitGPT to HF format

56de08f

"For transformers versions v4.40.0 or newer, we suggest using OLMo 7B…

0f3e2fc

… HF instead."

Update config

6a01707

Update READMEs

4dff7d8

Remove not-accessed import

9815034

Convert Lit -> HF: untie weights for 1B variant

24ca8a4

Andrei-Aksionov reopened this Jul 28, 2024

Andrei-Aksionov requested a review from williamFalcon as a code owner July 28, 2024 17:04

Andrei-Aksionov changed the title ~~[WIP] Add OLMo~~ Add OLMo: 1B & 7B Jul 28, 2024

Check if eos was added even if not supposed to

f29e9dd

add instruct model

94d875d

aflah02 mentioned this pull request Nov 7, 2024

Adding OLMo #1827

Merged

Andrei-Aksionov closed this Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OLMo: 1B & 7B #927

Add OLMo: 1B & 7B #927

rasbt commented Feb 12, 2024 •

edited by Andrei-Aksionov

Loading

rasbt commented Feb 13, 2024 •

edited

Loading

carmocca commented Feb 13, 2024

Andrei-Aksionov commented Feb 14, 2024

rasbt commented Feb 14, 2024 •

edited

Loading

rasbt commented Feb 14, 2024 •

edited

Loading

carmocca commented Feb 14, 2024

rasbt commented Feb 14, 2024

rasbt commented Feb 14, 2024 •

edited by carmocca

Loading

jwkirchenbauer commented Jun 29, 2024

Andrei-Aksionov commented Jun 29, 2024

rasbt commented Jul 1, 2024

Andrei-Aksionov commented Jul 28, 2024

rasbt commented Jul 29, 2024

Andrei-Aksionov commented Jul 30, 2024

aflah02 commented Nov 5, 2024 •

edited

Loading

Andrei-Aksionov commented Nov 5, 2024

aflah02 commented Nov 5, 2024 •

edited

Loading

Andrei-Aksionov commented Nov 5, 2024

rasbt commented Nov 5, 2024

aflah02 commented Nov 5, 2024

aflah02 commented Nov 7, 2024

aflah02 commented Nov 7, 2024

aflah02 commented Nov 7, 2024 •

edited

Loading

aflah02 commented Nov 7, 2024

Andrei-Aksionov commented Nov 7, 2024

aflah02 commented Nov 7, 2024

Andrei-Aksionov commented Nov 13, 2024

Add OLMo: 1B & 7B #927

Add OLMo: 1B & 7B #927

Conversation

rasbt commented Feb 12, 2024 • edited by Andrei-Aksionov Loading

rasbt commented Feb 13, 2024 • edited Loading

carmocca commented Feb 13, 2024

Andrei-Aksionov commented Feb 14, 2024

rasbt commented Feb 14, 2024 • edited Loading

rasbt commented Feb 14, 2024 • edited Loading

carmocca commented Feb 14, 2024

rasbt commented Feb 14, 2024

rasbt commented Feb 14, 2024 • edited by carmocca Loading

jwkirchenbauer commented Jun 29, 2024

Andrei-Aksionov commented Jun 29, 2024

rasbt commented Jul 1, 2024

Andrei-Aksionov commented Jul 28, 2024

rasbt commented Jul 29, 2024

Andrei-Aksionov commented Jul 30, 2024

aflah02 commented Nov 5, 2024 • edited Loading

Andrei-Aksionov commented Nov 5, 2024

aflah02 commented Nov 5, 2024 • edited Loading

Andrei-Aksionov commented Nov 5, 2024

rasbt commented Nov 5, 2024

aflah02 commented Nov 5, 2024

aflah02 commented Nov 7, 2024

aflah02 commented Nov 7, 2024

aflah02 commented Nov 7, 2024 • edited Loading

aflah02 commented Nov 7, 2024

Andrei-Aksionov commented Nov 7, 2024

aflah02 commented Nov 7, 2024

Andrei-Aksionov commented Nov 13, 2024

rasbt commented Feb 12, 2024 •

edited by Andrei-Aksionov

Loading

rasbt commented Feb 13, 2024 •

edited

Loading

rasbt commented Feb 14, 2024 •

edited

Loading

rasbt commented Feb 14, 2024 •

edited

Loading

rasbt commented Feb 14, 2024 •

edited by carmocca

Loading

aflah02 commented Nov 5, 2024 •

edited

Loading

aflah02 commented Nov 5, 2024 •

edited

Loading

aflah02 commented Nov 7, 2024 •

edited

Loading