Allow Tokens to Span Multiple Terminals in CFG #684

brandonwillard · 2024-02-20T04:46:09Z

Discussed in #683

^{Originally posted by lapp0 January 23, 2024}

What behavior of the library made you think about the improvement?

Currently generated tokens must be part of a terminal, or a complete terminal. A token cannot start at one terminal and end at another.

E.g. in the gpt2 tokenizer, {" is a valid token. However if { and " are separate terminals, as in the case of a typical json grammar, { is allowed in the initial states CFGFSM.allowed_token_ids(0) but {" is not.

This approach not only deviates technically from correct grammar representation, but also adversely affects generation quality. For example in the arithmetic grammar from README.md, using mistralai/Mistral-7B-v0.2, the most probable second token is + (space-prefixed), however because space is a separate terminal this token isn't legal, it selects + instead. In scenarios like this, spaces, though grammatically valid and model-preferred, are seldom produced. This is because the model would have to select the space as a standalone token to incorporate any spaces.

How would you like it to behave?

Permit the generation of any token that complies with a grammar's production rules and is valid in the context of the preceding sequence of tokens, regardless of whether it spans multiple tokens.

This will require careful engineering and benchmarking to ensure the new trie-of-RegexFSM described at the end of section 4.2 of the outlines paper works properly.

The text was updated successfully, but these errors were encountered:

brandonwillard · 2024-02-20T04:47:27Z

Converting this back into an issue because it does mostly describe a bug-like situation with the current implementation. Design and approach proposals should take place in the discussion, though.

lapp0 · 2024-05-31T02:53:37Z

As discussed in #796 (comment) resolving this issue will involve ensuring the parsing issues below are resolved

brandonwillard added enhancement grammar labels Feb 20, 2024

rlouf added this to Improve Outlines May 5, 2024

rlouf moved this to Todo in Improve Outlines May 5, 2024

lapp0 mentioned this issue Jun 12, 2024

Using context-free grammars to guide generation does not work #959

Closed

lapp0 mentioned this issue Jun 23, 2024

Stable cfg lapp0/outlines#34

Draft

10 tasks

This was referenced Jul 25, 2024

Cfg beta lapp0/outlines#85

Open

Update CFGGuide to use outlines.fsm.parsing. Enable generate.cfg #1067

Merged

brandonwillard closed this as completed in #1067 Aug 31, 2024

github-project-automation bot moved this from In Progress to Done in Improve Outlines Aug 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow Tokens to Span Multiple Terminals in CFG #684

Allow Tokens to Span Multiple Terminals in CFG #684

brandonwillard commented Feb 20, 2024

What behavior of the library made you think about the improvement?

How would you like it to behave?

brandonwillard commented Feb 20, 2024

lapp0 commented May 31, 2024

Allow Tokens to Span Multiple Terminals in CFG #684

Allow Tokens to Span Multiple Terminals in CFG #684

Comments

brandonwillard commented Feb 20, 2024

Discussed in #683

What behavior of the library made you think about the improvement?

How would you like it to behave?

brandonwillard commented Feb 20, 2024

lapp0 commented May 31, 2024