Request: NTK rope support #479

lucasjinreal · 2023-07-17T04:08:54Z

Hi, there are some very sucessfull experiements shows that NTK based RoPE can obtain a good extrapolate ability without even finetune.

I have test as well, it works well, an 1024 trained model can have a very impressive long context ability with NTK RoPE.

Would consider support it as it doesn't requires many changes (maybe)?

However, the pos op implement baked in cu op kernel.

Currently I can using torch code to judge if context length bigger than 2048 then applying NTK, but isn't would be better if vllm can support it out of box?

lucasjinreal · 2023-07-17T06:12:18Z

I have looked a little bit deeper, found actually this implementation is simple, no need to edit any cu files.

I have drafted a version to support ntk, see if it can works.

lucasjinreal · 2023-07-17T06:56:48Z

I have tested with NTK support in vllm, it works, the extrapolate can up to 8k without any finetuning.

lucasjinreal · 2023-07-17T07:04:32Z

Here was the main modification:

def forward(
        self,
        positions: torch.Tensor,
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        key_cache: torch.Tensor,
        value_cache: torch.Tensor,
        input_metadata: InputMetadata,
        cache_event: Optional[torch.cuda.Event],
        seq_len: int,
    ) -> torch.Tensor:
        """ PagedAttention forward pass with rotary embedding.

        Args:
            positions: shape = [num_tokens]
                        query: shape = [num_tokens, num_heads * head_size]
            key: shape = [num_tokens, num_heads * head_size]
            value: shape = [num_tokens, num_heads * head_size]
            key_cache: shape = [num_blocks, num_heads, head_size/x,
                block_size, x]
            value_cache: shape = [num_blocks, num_heads, head_size, block_size]
            input_metadata: metadata for paged attention.
            cache_event: event to wait for the cache operations to finish.

        Returns:
            shape = [num_tokens, num_heads * head_size]
        """

        # Apply rotary embedding to the query and key before passing them
        # to the attention op.
        if seq_len > self.max_seq_len_cached:
            print(f'debug dtypes: {value.dtype}, {query.dtype} {positions.device} {self.inv_freq.dtype}')
            t = torch.arange(seq_len, device=positions.device, dtype=self.inv_freq.dtype)
            inv_freq = self.inv_freq
            dim = self.dim
            alpha = seq_len / 1024 - 1
            base = self.base * alpha ** (dim / (dim - 2))
            inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float().to(positions.device) / dim))

            freqs = torch.einsum("i,j->ij", t, inv_freq)
            emb = torch.cat((freqs, freqs), dim=-1).to(positions.device)
            cos = emb.cos()
            sin = emb.sin()
            cache = torch.cat((cos, sin), dim=-1).to(self.inv_freq.dtype)
            pos_encoding_ops.rotary_embedding_neox(
                positions,
                query,
                key,
                self.head_size,
                cache,
            )
        else:
            pos_encoding_ops.rotary_embedding_neox(
                positions,
                query,
                key,
                self.head_size,
                self.cos_sin_cache,
            )
        return super().forward(
            query,
            key,
            value,
            key_cache,
            value_cache,
            input_metadata,
            cache_event,
        )

81549361 · 2023-07-18T12:14:38Z

Great, can you please tell me how to use it?

abarcovschi · 2023-07-21T11:33:29Z

Do you know if this can be extended to a 16k context size? If so could you please provide the code necessary for this? @lucasjinreal

ShadowTeamCN · 2023-08-01T11:09:34Z

I have tested with NTK support in vllm, it works, the extrapolate can up to 8k without any finetuning.

Here was the main modification:

def forward(
        self,
        positions: torch.Tensor,
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        key_cache: torch.Tensor,
        value_cache: torch.Tensor,
        input_metadata: InputMetadata,
        cache_event: Optional[torch.cuda.Event],
        seq_len: int,
    ) -> torch.Tensor:
        """ PagedAttention forward pass with rotary embedding.

        Args:
            positions: shape = [num_tokens]
                        query: shape = [num_tokens, num_heads * head_size]
            key: shape = [num_tokens, num_heads * head_size]
            value: shape = [num_tokens, num_heads * head_size]
            key_cache: shape = [num_blocks, num_heads, head_size/x,
                block_size, x]
            value_cache: shape = [num_blocks, num_heads, head_size, block_size]
            input_metadata: metadata for paged attention.
            cache_event: event to wait for the cache operations to finish.

        Returns:
            shape = [num_tokens, num_heads * head_size]
        """

        # Apply rotary embedding to the query and key before passing them
        # to the attention op.
        if seq_len > self.max_seq_len_cached:
            print(f'debug dtypes: {value.dtype}, {query.dtype} {positions.device} {self.inv_freq.dtype}')
            t = torch.arange(seq_len, device=positions.device, dtype=self.inv_freq.dtype)
            inv_freq = self.inv_freq
            dim = self.dim
            alpha = seq_len / 1024 - 1
            base = self.base * alpha ** (dim / (dim - 2))
            inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float().to(positions.device) / dim))

            freqs = torch.einsum("i,j->ij", t, inv_freq)
            emb = torch.cat((freqs, freqs), dim=-1).to(positions.device)
            cos = emb.cos()
            sin = emb.sin()
            cache = torch.cat((cos, sin), dim=-1).to(self.inv_freq.dtype)
            pos_encoding_ops.rotary_embedding_neox(
                positions,
                query,
                key,
                self.head_size,
                cache,
            )
        else:
            pos_encoding_ops.rotary_embedding_neox(
                positions,
                query,
                key,
                self.head_size,
                self.cos_sin_cache,
            )
        return super().forward(
            query,
            key,
            value,
            key_cache,
            value_cache,
            input_metadata,
            cache_event,
        )

does seq_len in this forward func equals to key.size(0)+ key_cache.size(0)?

lucasjinreal · 2023-08-01T11:17:53Z

@ShadowTeamCN Am not sure, it should same as torch side len

PaynatPierre · 2023-08-16T08:47:23Z

Here was the main modification:

def forward(
        self,
        positions: torch.Tensor,
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        key_cache: torch.Tensor,
        value_cache: torch.Tensor,
        input_metadata: InputMetadata,
        cache_event: Optional[torch.cuda.Event],
        seq_len: int,
    ) -> torch.Tensor:
        """ PagedAttention forward pass with rotary embedding.

        Args:
            positions: shape = [num_tokens]
                        query: shape = [num_tokens, num_heads * head_size]
            key: shape = [num_tokens, num_heads * head_size]
            value: shape = [num_tokens, num_heads * head_size]
            key_cache: shape = [num_blocks, num_heads, head_size/x,
                block_size, x]
            value_cache: shape = [num_blocks, num_heads, head_size, block_size]
            input_metadata: metadata for paged attention.
            cache_event: event to wait for the cache operations to finish.

        Returns:
            shape = [num_tokens, num_heads * head_size]
        """

        # Apply rotary embedding to the query and key before passing them
        # to the attention op.
        if seq_len > self.max_seq_len_cached:
            print(f'debug dtypes: {value.dtype}, {query.dtype} {positions.device} {self.inv_freq.dtype}')
            t = torch.arange(seq_len, device=positions.device, dtype=self.inv_freq.dtype)
            inv_freq = self.inv_freq
            dim = self.dim
            alpha = seq_len / 1024 - 1
            base = self.base * alpha ** (dim / (dim - 2))
            inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float().to(positions.device) / dim))

            freqs = torch.einsum("i,j->ij", t, inv_freq)
            emb = torch.cat((freqs, freqs), dim=-1).to(positions.device)
            cos = emb.cos()
            sin = emb.sin()
            cache = torch.cat((cos, sin), dim=-1).to(self.inv_freq.dtype)
            pos_encoding_ops.rotary_embedding_neox(
                positions,
                query,
                key,
                self.head_size,
                cache,
            )
        else:
            pos_encoding_ops.rotary_embedding_neox(
                positions,
                query,
                key,
                self.head_size,
                self.cos_sin_cache,
            )
        return super().forward(
            query,
            key,
            value,
            key_cache,
            value_cache,
            input_metadata,
            cache_event,
        )

In which file do you make this change exactly ?

EricLingRui · 2023-08-22T09:10:34Z

I passed in two samples in a batch with lengths of 6 and 8, respectively.
I print the positions value , its like:
[0,1,2,3,4,5,0,1,2,3,4,5,6,7]
If so, I guess pos_encoding_ops need to slice cos_sin_cache values internally separately.
so, I feel that it is difficult to implement NTK-aware without changing cu ops to adapt to batch infer.

hmellor · 2024-03-08T10:54:12Z

Closing as RoPE is now supported. If this is incorrect, feel free to re-open this issue.

lucasjinreal mentioned this issue Jul 17, 2023

RoPE scaling support? #464

Closed

zhuohan123 added the feature request label Jul 17, 2023

zhuohan123 mentioned this issue Jul 18, 2023

[Roadmap] vLLM Development Roadmap: H2 2023 #244

Closed

76 tasks

LiuXiaoxuanPKU mentioned this issue Jul 24, 2023

Support Longchat #555

Merged

EricLingRui mentioned this issue Aug 25, 2023

Qwen-7B, set max_num_batched_tokens=4096, then output role is null and content is empty string #809

Closed

pseudotensor mentioned this issue Aug 28, 2023

long context h2oai/h2ogpt#360

Open

hmellor closed this as completed Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request: NTK rope support #479

Request: NTK rope support #479

lucasjinreal commented Jul 17, 2023 •

edited

Loading

lucasjinreal commented Jul 17, 2023

lucasjinreal commented Jul 17, 2023

lucasjinreal commented Jul 17, 2023

81549361 commented Jul 18, 2023

abarcovschi commented Jul 21, 2023

ShadowTeamCN commented Aug 1, 2023 •

edited

Loading

lucasjinreal commented Aug 1, 2023

PaynatPierre commented Aug 16, 2023

EricLingRui commented Aug 22, 2023

hmellor commented Mar 8, 2024

Request: NTK rope support #479

Request: NTK rope support #479

Comments

lucasjinreal commented Jul 17, 2023 • edited Loading

lucasjinreal commented Jul 17, 2023

lucasjinreal commented Jul 17, 2023

lucasjinreal commented Jul 17, 2023

81549361 commented Jul 18, 2023

abarcovschi commented Jul 21, 2023

ShadowTeamCN commented Aug 1, 2023 • edited Loading

lucasjinreal commented Aug 1, 2023

PaynatPierre commented Aug 16, 2023

EricLingRui commented Aug 22, 2023

hmellor commented Mar 8, 2024

lucasjinreal commented Jul 17, 2023 •

edited

Loading

ShadowTeamCN commented Aug 1, 2023 •

edited

Loading