-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparison with tract #50
Comments
Is this using the same Assuming this is the case, then using the
For the first two shape combinations that take up most of the time, the ordering of the first two dimensions turns the operation from batched matrix x matrix multiplication into batched vector x matrix multiplication and that's not handled very efficiently currently. Looking at the model inputs, I see that the dim order is There are definitely optimizations possible in RTen here. In the interim, using |
Batched matrix multiplication is handled by prepacking one or neither of the inputs, depending on how often each is re-used, and then performing one `gemm` call per matrix in the output shape. This can be inefficient if the matrices in the batch end up being small in one or both dimensions, for example if one of the matrices is a vector. In that case it can be better to reshape the inputs so that instead of many low-arithmetic intensity `gemm` calls, a single higher-arithmetic intensity call is performed. The output is then reshaped to restore the batch dimensions. See #50
Thank you for answer! I will try to compare with batch_first=True |
Batched matrix multiplication is handled by prepacking one or neither of the inputs, depending on how often each is re-used, and then performing one `gemm` call per matrix in the output shape. This can be inefficient if the matrices in the batch end up being small in one or both dimensions, for example if one of the matrices is a vector. In that case it can be better to reshape the inputs so that instead of many low-arithmetic intensity `gemm` calls, a single higher-arithmetic intensity call is performed. The output is then reshaped to restore the batch dimensions. See #50
I tried to set up batch_first=True and it seems that the speed is the same:
I also changed order of batch_size and sequence_length) |
Internally it seems the matrix multiplications are still being done in the same order and devolving to vector x matrix. The draft in #51 should improve this once it lands. |
I'll look forward to it, thank you! |
Batched matrix multiplication was handled by prepacking one or neither of the inputs, depending on how often each is re-used, and then performing one `gemm` call per matrix in the output shape. This can be inefficient if one of the matrices passed to a gemm call ends up being small in one or both dimensions. For example in [1], the LHS / "A" input is a vector. In the case where the "A" input is a batch and the "B" input is a single matrix, the "A" input can be reshaped so a single gemm call can be used, with the output reshaped afterwards to restore the batch dimensions. In addition to the strategy, add a simple benchmark for different input shapes. [1] #50
Batched matrix multiplication was handled by prepacking one or neither of the inputs, depending on how often each is re-used, and then performing one `gemm` call per matrix in the output shape. This can be inefficient the LHS input has a small number of rows. For example in [1], the LHS / "A" input is a row vector. In the case where the "A" input is a batch and the "B" input is a single matrix, the "A" input can be reshaped so a single gemm call can be used, with the output reshaped afterwards to restore the batch dimensions. Implement this alternate approach and add a simple benchmark for batched matmul. [1] #50
#51 has been merged, which should improve performance in this case. |
Thank you so much! Yeah, now it works faster, but it seems that tract is still faster. On my laptop the difference is literally 5 - 10 milliseconds when encoding/decoding 1 token, but since each token is generated separately in a loop during decoding, the difference accumulates and depending on the length of the sequence can increase significantly. But now encoding with rten >=10 tokens is already faster. In general, I use the Transformer model to translate texts. And now I have a tract that works 1.5 times faster on average. But maybe it's because of my implementation, as I divide the text into sentences, and translate the sentences in parallel ... Well I'll try to look deeper into what might be causing the difference. But in the example above, you can just see the difference of 1.5 times if TOKEN_SIZE = 1 is specified (I updated the example) |
Are you running the model with a single sentence at a time (ie. batch size = 1) or with a batch of sentences? If you have many sentences what I would recommend doing is grouping the sentences into batches of approximately equal length, with appropriate padding and masks, and calling the model once per batch. Each call to run the model has some overhead, and passing in a batch of inputs amortizes this overhead over the batch. |
Yes, but if for example there are only 2 sentences of different length, it seems that it will waste time to decode the paddings of one of the sentences. I think in Python would be used the same way if it had the same ability to parallelize code as Rust. I checked the inference of decoder and it runs also slower than with tract when there is batch_size = 1. I can attach weights of decoder and example of inference if you want. I prepare example code with inference all model's parts:
By the way rten now initializes models much faster than tract, and if you can do the same speed inference when batch_size=1 for decoder and generator, that would be awesome! I can give more examples and more information about decoder and generator if needed |
It is true that computation on padding is wasted, but depending on how much padding there is, the benefits of batching can still outweigh this. What I do in Ocrs for recognizing lines of text, of varying width, is to choose a threshold and form batches such that every image in the batch has no more than In any case, I agree that the
Yes, that would be helpful. It would understand at a high level what the processing pipeline / inference loop looks like, and what are typical/realistic input shapes for each of the models. I'm familiar with the standard transformer encoder-decoder where you do something like:
Is that what you are doing here or is it different? I notice that the decoder doesn't have KV-cache inputs, so presumably you're feeding it with a sequence that grows longer for each iteration of the loop? |
yes, I use a Transformer exactly as you described. You can find translate and greedy_decode functions here: https://pytorch.org/tutorials/beginner/translation_transformer.html
Yeah, I think I should update the model architecture to be able to store cache from the attention mechanism while decoding the sequence. I'll also try to figure out the ocrs repository, thanks a lot! Btw it would be awesome to use quant models, I attached weights in #42 |
I looked at the code in nn.Transformer, and it seems they don't use a key/value cache, because they don't run what the encoder outputs through the Linear Layer, but use it immediately in the attention mechanism. So I will try to check running with batch of sentences |
Thanks for the input on this. The latest RTen releases include various optimizations for transformer decoders, including:
See the changes for the 0.6.0, 0.7.0 releases for details. |
In general I noticed that many models with rten run faster than tract (I think due to the fact that inference on tract run in single thread). but I noticed an interesting thing with transformers.
If the sequence length is short, for example 10 tokens, then inference with tract on my machine is 3 times faster. As the sequence length increases, rten starts to overtake tract in speed. For example, if submit 1000 tokens, then rten is 2 times faster.
But since during inference the decoder outputs each token separately, the difference becomes noticeable regardless of the sequence size.
code example:
I can attach weights if needed
The text was updated successfully, but these errors were encountered: