Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OLMo: 1B & 7B #927

Closed
wants to merge 21 commits into from
Closed

Add OLMo: 1B & 7B #927

wants to merge 21 commits into from

Conversation

rasbt
Copy link
Collaborator

@rasbt rasbt commented Feb 12, 2024

Adds the popular and fully open-source OLMo models by Allen AI.

  • Implement model download
  • Test tokenizer
  • Implement HF checkpoint conversion
  • clean up HF checkpoint conversion
  • Make sure to use the right layer normalization
  • Make sure generate.py produces reasonable outputs
  • Update download and finetuning docs
  • Test pretraining
  • Test finetuning
    • Full finetuning
    • LoRA
    • Adapter
  • Add tests
  • Update README

Fixes #925

lit_gpt/config.py Outdated Show resolved Hide resolved
@rasbt
Copy link
Collaborator Author

rasbt commented Feb 13, 2024

I'm a big stuck with the conversion and would appreciate your advice and ideas @carmocca or @Andrei-Aksionov!

So, here are 3 special things about Olmo:

  1. they used weight tying like in GPT-2: They reuse the WTE weight as the output projection weight. The way they saved the tensors on the Hub though they simply duplicated that tensor so there shouldn't be any action required. When loading the model in HuggingFace, I checked that olmo.model.transformer.wte.weight and olmo.model.transformer.ff_out.weight contain the same tensor. That should be all good here.

  2. They use a non-parametric LayerNorm. I.e., their LayerNorm doesn't have the scale (weight) and shift (bias) parameters. To avoid any code changes just for that model, my workaround is to just use zeros and ones so that these have no effect:

        state_dict[f"transformer.h.{l}.norm_1.weight"] = torch.ones(config.n_embd)
        state_dict[f"transformer.h.{l}.norm_2.weight"] = torch.ones(config.n_embd)
        state_dict[f"transformer.h.{l}.norm_1.bias"] = torch.zeros(config.n_embd)
        state_dict[f"transformer.h.{l}.norm_2.bias"] = torch.zeros(config.n_embd)
  1. The problem is that I'm missing weights ...

The HF version is like this, which is confusing, because the ops are not applied in that "sequential" order as far as I can tell:

OLMoForCausalLM(
  (model): Olmo(
    (transformer): ModuleDict(
      (wte): Embedding(50304, 2048)
      (emb_drop): Dropout(p=0.0, inplace=False)
      (ln_f): LayerNorm()
      (blocks): ModuleList(
        (0-15): 16 x OlmoSequentialBlock(
          (dropout): Dropout(p=0.0, inplace=False)
          (act): SwiGLU()
          (attn_out): Linear(in_features=2048, out_features=2048, bias=False)
          (ff_out): Linear(in_features=8192, out_features=2048, bias=False)
          (rotary_emb): RotaryEmbedding()
          (attn_norm): LayerNorm()
          (ff_norm): LayerNorm()
          (att_proj): Linear(in_features=2048, out_features=6144, bias=False)
          (ff_proj): Linear(in_features=2048, out_features=16384, bias=False)
        )
      )
      (ff_out): Embedding(50304, 2048)
    )
  )
)

Unless I'm wrong, I think what happens is that ff_proj is a placeholder for the mlp F1 and FC2 layers. I.e., the first half is FC1 and the second half is FC2. It's kind of confusing though.

What I am thinking is that we have to split the fc weights, which would avoid us having to write some custom code in the GPT model class:

    weight_map = {
        "model.transformer.wte.weight": "transformer.wte.weight",
        "model.transformer.ff_out.weight": "lm_head.weight",
        "model.transformer.blocks.{}.attn_out.weight": "transformer.h.{}.attn.proj.weight",
        "model.transformer.blocks.{}.ff_proj.weight": "transformer.h.{}.mlp.fc_1.weight", # split into fc1 and fc2
        "model.transformer.blocks.{}.att_proj.weight": "transformer.h.{}.attn.attn.weight",
        "model.transformer.blocks.{}.ff_out.weight": "transformer.h.{}.mlp.proj.weight",
    }
...

    for l in range(config.n_layer):
        state_dict[f"transformer.h.{l}.mlp.fc_2.weight"] = state_dict[f"transformer.h.{l}.mlp.fc_1.weight"][config.n_embd:]
        state_dict[f"transformer.h.{l}.mlp.fc_1.weight"] = state_dict[f"transformer.h.{l}.mlp.fc_1.weight"][:config.n_embd]

is this somehow possible with the lit-gpt SaveProxyTensor?

@carmocca
Copy link
Contributor

Hey! All your suggestions make sense to me. You should be able to split the combined ff linear as you suggest, especially if load_param has ben called already. We also manipulate the qkv linears for llama2 checkpoints in a similar way.

However, note that your workarounds will only work for inference. During training, wte and ff_out will not be tied and the layernorm parameters wont be frozen.

@Andrei-Aksionov
Copy link
Collaborator

Hello @rasbt

Looks like you are correct. I just wanted to add a couple of things that I've noticed while reviewing their code. For posterity, sorta speak.

I also don't like when the layers are initialized not in the order they are executed. Lit-GPT also does it: first we create lm_head and only then transformer layers 🙃.

So the order of execution should be as such:

# Attention
1. attn_norm
2. attn_proj (2048, 6144) <- combined QKV
3. rotary_emb
4. attn_out (2048, 2048)
5. dropout
# MLP
1. ff_norm
2. ff_proj (2048, 16384) <- combined [fc_2, fc_1] / [up, gate] in LlaMA notation
3. act
4. ff_out (8192, 2048)
5. dropout
  1. Yes, they use weight_tying. It's configurable and they decided to use it. And yes, it won't work during training. Although it's not difficult to add if more models will use it.
  2. Their LayerNorm class supports weight and bias parameters, but it's controlled by the config. It looks like they turned off .weight and .bias per config.
  3. This deserves a bit more explanation.
    In LlaMAMLP class we have fc_1, fc_2 and proj. During the forward pass we apply fc_1 and fc_2 on the input separately:
    https://github.com/Lightning-AI/lit-gpt/blob/f5d68065ff621fc2cc190c05dcc4ab2cda1d1f57/lit_gpt/model.py#L286-L290

Olmo has only two layer: ff_proj and ff_out. They decided to take an approach that is similar to a combined QKV matrix and created ff_proj layer that does this matmul op in one go. But then, the way they split the result is I would say an unexpected - in the activation function:

def forward(self, x: torch.Tensor) -> torch.Tensor:
        x, gate = x.chunk(2, dim=-1)
        return F.silu(gate) * x

and then they apply ff_out to it.

Important to note the way they split and then apply activation function on a chunk. That means that:
ff_proj == [fc_2, fc_1]
ff_out == proj

@rasbt
Copy link
Collaborator Author

rasbt commented Feb 14, 2024

Thanks so much for the feedback @carmocca and @Andrei-Aksionov , this was super helpful! After more tinkering, I went with a custom OLMoMLP (analogous to LLaMALMLP) because I thought this was easier than the other workarounds -- both from an implementation perspective but also code-readability in the future.

The weights load ok now, but for some reason, the results are garbage. E.g., for

python generate/base.py --checkpoint_dir ./checkpoints/allenai/OLMo-1b/

What food do llamas eat?lerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslerslers

And

python generate/base.py --checkpoint_dir ./checkpoints/allenai/OLMo-7b/

What food do llamas eat? nic except ' Up Area has , climate * new area even county bun dressingall Bul Index millions Di withdrawal intent except / bun ID tonnes approve welcome St/ regimes health ng est African worse Multiple; p; ques Up ( IL'' Area / p

@rasbt
Copy link
Collaborator Author

rasbt commented Feb 14, 2024

Yes, they use weight_tying. It's configurable and they decided to use it. And yes, it won't work during training. Although it's not difficult to add if more models will use it.

Actually, upon further inspection they only use weight tying for the 1B (https://huggingface.co/allenai/OLMo-1B/blob/main/config.json#L42) model not for the 7B model (https://huggingface.co/allenai/OLMo-7B/blob/main/config.json#L42). I adjusted the code accordingly. Still not working well though.

@carmocca
Copy link
Contributor

I would strongly prefer that we don't add this new MLP class.

To debug the output, you'll have to inspect the activations for both models layer by layer to see where they diverge

@rasbt
Copy link
Collaborator Author

rasbt commented Feb 14, 2024

I would strongly prefer that we don't add this new MLP class.

Ok! Maybe let's leave it in there until we got it to work, and then we can refactor it into one of the existing classes somehow.

@rasbt
Copy link
Collaborator Author

rasbt commented Feb 14, 2024

Just to add a note about pinpointing the difference. With Carlos's help, we found that the difference currently is in how the QKV matrix is split into queries, keys, and values.

https://github.com/Lightning-AI/lit-gpt/blob/main/lit_gpt/model.py#L195-L202

and

https://github.com/allenai/OLMo/blob/main/olmo/model.py#L687
https://github.com/allenai/OLMo/blob/main/olmo/model.py#L559-L571

In Lit-GPT, the Q, K, and V are interleaved (to also support MQA) whereas in OLMo, QKV are not interleaved.

We could potentially accommodate OLMo in Lit-GPT if we apply the steps here from Llama in the conversion script but in reverse: https://github.com/Lightning-AI/lit-gpt/blob/main/scripts/convert_hf_checkpoint.py#L182-L186

@jwkirchenbauer
Copy link

Hi there!
What is the status on olmo support?
I recognize how this PR relates to #1013 relating to the qkv implementation.

@Andrei-Aksionov
Copy link
Collaborator

Hello @jwkirchenbauer
I think #1341 is architectually similar to Olmo, so after it is merged, it should be much easier to implement olmo.
Somewhere next week, maybe.
Depends on when that PR is merged, plus, I guess Gemma 2 is in priority right now.
But I'll do my best to implement it as soon as possible since there is still interest 😉.

#1013 is a breaking change, so it might take time for reviewers to do their part.

@rasbt
Copy link
Collaborator Author

rasbt commented Jul 1, 2024

If this PR gets revived some time, we should check out the qkv_reassemble function from #1341

@rasbt rasbt closed this Jul 3, 2024
@Andrei-Aksionov Andrei-Aksionov changed the title [WIP] Add OLMo Add OLMo: 1B & 7B Jul 28, 2024
@Andrei-Aksionov
Copy link
Collaborator

A tale of a PR that rose from ashes. 🙂

I tested both model, they generate normally looking text, though cannot answer a question.
As I understand, these models aren't finetuned ones, so that's most likely why.

@rasbt I didn't try to finetune it, but it looks like there are no significant architectural changes, so it should be ok.

cc @jwkirchenbauer Finally made time.

@rasbt
Copy link
Collaborator Author

rasbt commented Jul 29, 2024

Wow thanks for resurrecting it and pushing it forward!

@Andrei-Aksionov
Copy link
Collaborator

It looks like the instruct variant expects a prompt in a specific format:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-7B-Instruct-hf")

message = [{"role": "user", "content": "{prompt}"}]
display(tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=True))
"<|endoftext|><|user|>\n{prompt}\n<|assistant|>\n"

@aflah02
Copy link
Contributor

aflah02 commented Nov 5, 2024

Hey
Is this PR still being worked on? I'm interested in contributing to the same if no one else is working on this but from what I understand it's pretty much finished with some testing left right? (I might be wrong though)
Thanks!

@Andrei-Aksionov
Copy link
Collaborator

Hello @aflah02
It's been a while since I worked on it, but, from what I remember, it should be finished (but not quite 🙃).

If you want to contribute (which is awesome), then you need to:

  1. Resolve conflicts 🫠
  2. Check the instruct version, add a template if needed, run a generation and post the full output (starting from the command) here

@aflah02
Copy link
Contributor

aflah02 commented Nov 5, 2024

Thanks for the pointers @Andrei-Aksionov
Since I can't push to the same branch and it would be difficult to resolve all conflicts directly. I am thinking of creating a new PR copying over the current set of changes from this PR and build on top of that. Does that work?
Unfortunately, that would lead to loss of attribution of the commits to the people who've worked on this PR before.

@Andrei-Aksionov
Copy link
Collaborator

Fine by me.

Unfortunately, that would lead to loss of attribution of the commits to the people who've worked on this PR before.

Sebastian and I worked on it.
I personally don't mind. Finally merging it is more valuable for me.

What do you think, @rasbt ?

@rasbt
Copy link
Collaborator Author

rasbt commented Nov 5, 2024

Thanks for your interest in pushing this PR to the finish line @aflah02. Unfortunately I recently got swamped with other work responsibilities so I currently can't spend as much time as I wanted to on this and would definitely appreciate the contribution!

Unfortunately, that would lead to loss of attribution of the commits to the people who've worked on this PR be

I personally also don't mind the lack of attribution for me here. Thanks for doing this tedious task of copying over the branch!

@aflah02
Copy link
Contributor

aflah02 commented Nov 5, 2024

Thanks @Andrei-Aksionov @rasbt
I'll work on a PR over the next week and share the link here once I'm ready

@aflah02
Copy link
Contributor

aflah02 commented Nov 7, 2024

It looks like the instruct variant expects a prompt in a specific format:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-7B-Instruct-hf")

message = [{"role": "user", "content": "{prompt}"}]
display(tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=True))
"<|endoftext|><|user|>\n{prompt}\n<|assistant|>\n"

Hi @Andrei-Aksionov

I was looking at the chat template in the OLMo HF tokenizer config - https://huggingface.co/allenai/OLMo-7B-Instruct/blob/main/tokenizer_config.json#L242 and it looks much more complex than the one you've shared above partly due to the for loops etc. which I guess are used to keep applying the template to all messages in the conversation. I am not super familiar with how LitGPT adds these prompts but looking at prompts.py it seems that for some models the implementation is just to apply the template -

class Llama2(PromptStyle):
    def apply(self, prompt: str, **kwargs: str) -> str:
        b_inst, e_inst = "[INST]", "[/INST]"
        b_sys, e_sys = "<<SYS>>\n", "\n<</SYS>>\n\n"
        return (
            f"{b_inst} {b_sys}You are a helpful, respectful and honest assistant. Always answer as helpfully as"
            " possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist,"
            " toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and"
            " positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why"
            " instead of answering something not correct. If you don't know the answer to a question, please don't"
            f" share false information.{e_sys} {prompt} {e_inst} "
        )

While for others multiple cases like when the prompt is a str v/s a list and what are the stop_tokens are handled explicitly -

class Llama3(PromptStyle):
    def apply(self, prompt: Union[str, List[Dict[str, str]]], **kwargs: str) -> str:

        default_system_prompt = "You are a helpful assistant."

        # https://github.com/meta-llama/llama3/blob/359887376f0aaf30e433f23e25df858d8c2a9833/llama/tokenizer.py#L202-L229
        if isinstance(prompt, str):
            return (
                "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n"
                f"{default_system_prompt}<|eot_id|>" # No newline
                "<|start_header_id|>user<|end_header_id|>\n\n"
                f"{prompt}<|eot_id|>" # No newline
                "<|start_header_id|>assistant<|end_header_id|>\n\n"
            )
        elif isinstance(prompt, list):

            def encode_header(role: str) -> List[str]:
                return [f"<|start_header_id|>{role}<|end_header_id|>\n\n"]

            def encode_message(message: Dict[str, str]) -> List[str]:
                tokens = encode_header(message["role"])
                # NOTE: Meta stripped this. I'm not sure I agree, but who am I to argue?
                tokens.append(message["content"].strip())
                tokens.append("<|eot_id|>")
                return tokens

            def has_system_prompt(messages: List[Dict[str, str]]) -> bool:
                return messages[0].get("role", "") == "system" if len(messages) else False

            tokens = ["<|begin_of_text|>"]
            if not has_system_prompt(prompt):
                tokens.extend(encode_message({"role": "system", "content": default_system_prompt}))
            for i, message in enumerate(prompt):
                if i != 0 and message["role"] == "system":
                    raise ValueError("'system' role is only allowed at the beginning of the conversation list.")
                if not message["role"] in ["assistant", "user", "system"]:
                    raise ValueError(f"Unknown role: '{message['role']}'. Supported roles are 'assistant', 'user', and 'system'.")
                tokens.extend(encode_message(message))
            tokens.extend(encode_header("assistant"))
            return "".join(tokens)
        else:
            raise ValueError(f"Unsupported prompt type: {type(prompt)}")

    def stop_tokens(self, tokenizer: "Tokenizer") -> Tuple[List[int], ...]:
        return (
            [tokenizer.eos_id],
            [tokenizer.token_to_id("<|eot_id|>")],
        )

How does one figure out which route is needed for their specific model?

@aflah02
Copy link
Contributor

aflah02 commented Nov 7, 2024

Also as I was copying over the changes I was also referring to this adding models guide (https://github.com/Lightning-AI/litgpt/blob/main/tutorials/developer-docs/adding-models.md) and I noticed that even without copying over the checkpoint conversion files it just worked out of the box. Is this expected? I thought it wouldn't given there was custom logic written to convert checkpoint from hf to lit and vice-a-versa

@aflah02
Copy link
Contributor

aflah02 commented Nov 7, 2024

Here are the outputs from the 7B Instruct Model. These look good to me

image

image

@aflah02 aflah02 mentioned this pull request Nov 7, 2024
@aflah02
Copy link
Contributor

aflah02 commented Nov 7, 2024

Also raised a PR. Not all changes are done but some of the changes in this PR don't seem to be needed - https://github.com/Lightning-AI/litgpt/pull/1827/files

@Andrei-Aksionov
Copy link
Collaborator

Hello @aflah02

I believe only the prompt for Llama 3 supports multi-turn generation.
I think it was done for a better integration with LitServe.
Others only do a regular generation.
We have an opened issue to support multi-turn for other models: #1488

I think for now it's fine to add a template for the basic template and then, in later PRs, extend it.


Also as I was copying over the changes I was also referring to this adding models guide (https://github.com/Lightning-AI/litgpt/blob/main/tutorials/developer-docs/adding-models.md) and I noticed that even without copying over the checkpoint conversion files it just worked out of the box. Is this expected? I thought it wouldn't given there was custom logic written to convert checkpoint from hf to lit and vice-a-versa

I didn't notice that, but the script, that I wrote for converting weights, is actually a special case of copy_weights_hf_llama 🙂.

@aflah02
Copy link
Contributor

aflah02 commented Nov 7, 2024

Thanks for the confirmation!

@Andrei-Aksionov
Copy link
Collaborator

Closing it as it was implemented in #1827

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding OLMo 1B and 7B
5 participants