-
Notifications
You must be signed in to change notification settings - Fork 541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Transformers
model and Completion
sequence generation
#139
Conversation
9fa796f
to
137c64a
Compare
361c022
to
f73a41e
Compare
Transformers
model and Completion
sequence generation
76879da
to
cae171b
Compare
Left to implement: Models
Completion
|
5dbef1d
to
d64980a
Compare
It might be easier to allow the user to handle the caching. That way the model doesn't have to save all the text it has ever been prompted with. E.g.
I think this is a bit similar to how Hugging Face handles kv-caches right now. It would also allow using a kv_cache that has been "learned" by fine-tuning, rather than representing an actual prefix. At some point we'd like to give the model access to a vector db of "external" key-value pairs. I wonder if you are interested in having such a feature in outlines as well. |
It's not documented yet, API might still change.
Could be, or I'm doing something wrong with the arrays. Given the size of token id arrays I didn't think it would make a huge difference, but I will need benchmark this. One question before I start benchmarking: did you set
The default stopping criterion is |
I timed Outlines and def test_time_outlines():
import time
import outlines.models as models
from outlines.text.sequences.continuation import continuation
now = time.time()
model = models.transformers("gpt2")
sequence = continuation(model, max_tokens=100)("This is a prompt")
print(f"Outlines: {time.time()-now:.2f}")
def test_time_hf():
import time
from transformers import AutoModelForCausalLM, AutoTokenizer
now = time.time()
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
inputs = tokenizer(["This is a prompt"], return_tensors="pt")
out = model.generate(**inputs, do_sample=True, max_new_tokens=100)
sequence = tokenizer.batch_decode(out)
print(f"HuggingFace: {time.time()-now:.2f}") On CPU it is a wash, however setting def test_time_outlines():
import time
import outlines.models as models
from outlines.text.sequences.continuation import continuation
now = time.time()
model = models.transformers("gpt2")
sequence = continuation(model, max_tokens=100)("This is a prompt", samples=10)
print(f"Outlines: {time.time()-now:.2f}")
def test_time_hf():
import time
from transformers import AutoModelForCausalLM, AutoTokenizer
now = time.time()
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
inputs = tokenizer(["This is a prompt"], return_tensors="pt")
out = model.generate(**inputs, do_sample=True, max_new_tokens=100, num_return_sequences=10, use_cache=False)
sequence = tokenizer.batch_decode(out)
print(f"HuggingFace: {time.time()-now:.2f}") Passing the KV cache around is thus very important. However I think managing the cache won't change the design introduced here dramatically, but it would add to an already big PR. #150 already tracks this issue. Therefore I suggest we address this issue separately, if anything so it can be solved in parallel of #151 - #156 which are also blocked by this PR. @arunpatro Would you mind comparing the runtimes on GPU with Update: Using |
Yeah surely.
What is the intuition here? How can outlines be faster than native, considering we do a lot of work on top? Are we avoiding extra boilerplate code that HF has inside their |
We do less things than native in
The reason I decouple the model call / sampling this way is to be able to support different model providers and sampling methods (like SMC) in the future. |
I see, that makes sense.
I like this design decision, of de-coupling the model and the generation process. Question 1: Why does outlines code not load the model in GPU automatically? According to the code I see, if Question 2: Why are we moving back and forth torch and numpy arrays? I modified this branch to only use torch.tensors and it can be drop in replacement (except MPS devices because I also ran your test cases and I can confirm that outlines is faster on cuda for # model_name = "gpt2"
model_name = "togethercomputer/RedPajama-INCITE-Instruct-3B-v1"
prompt = "This is a prompt"
def test_time_outlines():
import time
import outlines.models as models
from outlines.text.sequences.continuation import continuation
now = time.time()
model = models.transformers(model_name)
sequence = continuation(model, max_tokens=100)(prompt, samples=10)
print(f"Outlines: {time.time()-now:.2f}")
def test_time_hf():
import time
from transformers import AutoModelForCausalLM, AutoTokenizer
now = time.time()
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer([prompt], return_tensors="pt")
out = model.generate(**inputs, do_sample=True, max_new_tokens=100, num_return_sequences=10, use_cache=False)
sequence = tokenizer.batch_decode(out)
print(f"HuggingFace: {time.time()-now:.2f}")
def test_time_outlines_cuda():
import time
import outlines.models as models
from outlines.text.sequences.continuation import continuation
now = time.time()
model = models.transformers(model_name, 'cuda')
sequence = continuation(model, max_tokens=100)(prompt, samples=10)
print(f"Outlines CUDA: {time.time()-now:.2f}")
def test_time_hf_cuda():
import time
from transformers import AutoModelForCausalLM, AutoTokenizer
now = time.time()
model = AutoModelForCausalLM.from_pretrained(model_name).to('cuda')
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer([prompt], return_tensors="pt")['input_ids'].to('cuda')
out = model.generate(inputs, do_sample=True, max_new_tokens=100, num_return_sequences=10, use_cache=False)
sequence = tokenizer.batch_decode(out)
print(f"HuggingFace CUDA: {time.time()-now:.2f}") The times show that HF is faster than Outlines on both cuda and cpu (for HuggingFace CUDA: 31.83 Tested on |
Thanks! Are those average times over several runs? I'm asking because with any other model than GPT2 the sequence is likely to terminate before A possibility for the CUDA case is that with a bigger vocabulary we're paying for the memory transfer, which can be avoided by delegating next-token-sampling to the model class as well. This way we're only transferring a few tokens instead of the full logits. The reason for keeping NumPy is to keep as much generality as we can if we want to include llamacpp, JAX or TF I don't have an explanation for the CPU case... PS: do you need an A100 for a 3B model? |
No, these are not average numbers, but I ran this multiple times, and its similar.
Not really, but I am experimenting with upto 13B models, and sometimes I need a lot of GPU-RAM cuz of batchsizes (vLLM requirements, etc) |
Because HF models are loaded on cpu by default, and you need to explicitly move them to another device. I'm following their conventions as much as I can. Do you think we should do differently?
I have thought more about this, and there's probably no good reason to not be using PyTorch as the default. This would allow to keep a strict separation between model calls and logit manipulation without moving memory around. Arrays output by llamacpp can be converted into Since switching to PyTorch requires a little more exploration and doesn't change the overall structure of the code introduced here I suggest we merge this PR and track this in #164. Wdyt @brandonwillard? |
7f4387d
to
96c5ef0
Compare
Yeah, let's do that in a quick follow-up. |
No, this is good. We should stick to HF wherever we can for defaults. The correct and fastest way to load model to device, is to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor questions and comments; otherwise, I'm good with moving forward on this.
b487a10
to
24ef978
Compare
Some notes on sequence generation
If we note$\boldsymbol{t} = (t_1, \dots, t_n)$ a sequence of $n$ tokens that represent a string, then we can define the next token $t_{n+1}$ in the sequence as the following random variable:
where$\boldsymbol{\theta}$ is the set of trained parameters. This random variable's support is the entire token vocabulary $\mathcal{V}$ (~ $10^3$ to $10^4$ tokens). In the following, the $\mathrm{LM}$ function will typically refer to a deep neural network trained on next-token-completion tasks, but it does not need to be and can be as simple as a function that returns a constant vector with equal values.
This random variable is the basic building block that allows us to sample text sequences. Representing these random variables explicitly is thus the first step in an effort to refactor the generation process in Outlines to make it more flexible, to allows a larger variety of sequence-generating processes and ways to sample them.
Using the same model call we can define many other random variables by manipulating the output logits$\boldsymbol{\alpha}$ , randomly or deterministically. A particularly interesting example is when one applies a boolean mask $m$ to restrict the support of the distribution like so:
This mask can encode some a priori knowledge about what the support should be. For instance:
<EOS>
(end-of-sequence) token from the support;[a-zA-Z]
regular expression.We can summarize the above in the following notation:
where$\mathrm{Token}_m$ is a random variable with support $\mathcal{V} \backslash \left\{ i | m_i = 0 \right\}$ and parametrized by $\boldsymbol{t}$ .
In practice
Generating a text sequence using language models requires:
Models
In this PR we introduce the
Transformers
object which is responsible of initializing and calling the model. This implicitly defines the interface that models have with the rest of the library (a parent class will be added). In particular:The k-v cache will require some careful thought, and will likely have to be customized to the particular class of models for which it is implemented (
transformers
,llama.cpp
, etc.). My initial though is to build a trie that we query each time the model is called; this cache is attached to the model instance.Another difficulty arises when the workflow uses different local models, since we cannot hold all of the models' weights in memory. We may need a process that supervises the models so that when a model is called we know whether it is currently loaded in memory.
We don't necessarily need to solve both these problems now, and they can be turned into issues.
Sequences
Sequences are objects that represent a sequence-generation model; when called with a prompt or a list of prompt they generate the sequence(s). Here we implement the simplest possible sequence, completion until an EOS token is found. The proposed API is as follows:
When sampling generations we can simply return a (list of) string(s), but for more advanced generation mechanisms we will need to return a state object that contains both the completion and the corresponding probabilities. This also means that we will need to unpack/repack the state when calling python functions in the middle of a chain.
Local model and API calls
Models based on API calls are less flexible; for instance we cannot shape the proposals completely (when we can) and don't have access to the logits. They have a different interface and will thus need to inherit from a different base class.
This also means that the
Sequence
implementation needs to be different; to make this transparent to the user we implementcompletion
function which dispatches toCompletion
when the logits are available, and to a custom implementation when it is not. It can possibly fail when some generation constraints cannot be applied when calling the API.