Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] refactor: init some experimental refactoring. #362

Closed
wants to merge 1 commit into from

Conversation

AsakusaRinne
Copy link
Collaborator

This PR refactors the implementations of LLamaSharp, which is targeted for v1.0.0.

I pushed some changes but has not completed it yet. This PR now is more like a proposal, no promising for compiling and test.

Currently I've made the following changes

  1. Add IGenerationControl to allow customized control for the stop of generation, instead of just anti-prompts.
  2. Add ITokenizer to allow customized tokenizer.
  3. Allow call stateless executor with self-managed context.
  4. Add TextCompletion, which is corresponding to ChatSession, to provide some further wrapping for text completion tasks. Related issue: Create HTTP API server and provide API like OAI #269 Garbled output from model in Unity #178

Some other proposals

  1. Make the InstructExecutor obsolete because it's only one of the mode of text completion.
  2. Allow self-managed kv cache, providing APIs to allocate, update and release the cache. (maybe do it on master?)
  3. Apply batch decoding, which will refactor the StatelessExecutor again

For discussion

I'm wondering if we should introduce Microsoft.SemanticKernel.Abstractions to LLamaSharp, instead of just using it in SK integration. The reasons are listed as following:

  1. Some features such as function call are complex and have been supported or partially supported in SK. We'll have a hard time to support them from scratch because of short of hands and the quick speed of LLM iteration.
  2. Some useful abstractions are defined in SK. We could just take use these abstractions even though we have different implementations of the usages. For example, prompt template. It will provide convenience for both users and SK integration developers.
  3. SK has been on the way to the standard of LLM in dotnet.

However, though, I also have some worrying about it:

  1. We'll have to follow the changes of SK. If SK makes some break changes, we may have to rush to follow it.
  2. Our directions of development will be restricted. I'm not sure how will SK be in the future. Though SK is opensource, it has many contents for Azure service because it's developed by MicroSoft. On the contrary, LLamaSharp absolutely does not has this preference.

@martindevans @SignalRT @saddam213 May I ask for some ideas from you? This is not a formal PR so please feel free to talk about anything.

@AsakusaRinne AsakusaRinne added help wanted Extra attention is needed break change labels Dec 13, 2023
@martindevans
Copy link
Member

martindevans commented Dec 13, 2023

Allow self-managed kv cache, providing APIs to allocate, update and release the cache. (maybe do it on master?)
Apply batch decoding, which will refactor the StatelessExecutor again

These two will probably be handled by my ongoing work to redesign the executors. My plan is a new executor, built on a single batch, which can have multiple ongoing "conversations" that can all be updated in one single batch.

I'm wondering if we should introduce Microsoft.SemanticKernel.Abstractions to LLamaSharp, instead of just using it in SK integration

Personally I would be against this, I think we should keep LLamaSharp as close to just being a wrapper around llama.cpp as possible. However, we should definitely make sure the library is flexible enough that something like SK can be integrated with it!

/// <summary>
/// Set a custom generation control to use. <b>If this is set antiprompt will be ignored!</b>
/// </summary>
IGenerationControl GenerationControl { get; set; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
IGenerationControl GenerationControl { get; set; }
IGenerationControl? GenerationControl { get; set; }

/// <summary>
/// Set a custom tokenizer to use.
/// </summary>
ITokenizer Tokenizer { get; set; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ITokenizer Tokenizer { get; set; }
ITokenizer? Tokenizer { get; set; }

namespace LLama.Control
{
/// <summary>
/// The default generation control in LLamaSharp, using antiprompts. This class should not be inherited.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// The default generation control in LLamaSharp, using antiprompts. This class should not be inherited.
/// The default generation control in LLamaSharp, using antiprompts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's sealed, so it's not possible to extend this class.

{
/// <summary>
/// The default generation control in LLamaSharp, using antiprompts. This class should not be inherited.
/// <b>Note that this class has state. The previous outputs feeded to it will affect its control.</b>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// <b>Note that this class has state. The previous outputs feeded to it will affect its control.</b>
/// <b>Note that this class has state. The previous outputs fed to it will affect its output.</b>

/// </summary>
public bool ShouldStopGeneration(LLamaContext context, IInferenceParams inferenceParams, IEnumerable<int> lastOutputIds)
{
return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be returning false?

namespace LLama.Transform
{
/// <summary>
/// The default tokenizer of LLamaSharp. This class should not be inherited.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// The default tokenizer of LLamaSharp. This class should not be inherited.
/// The default tokenizer of LLamaSharp.

{
/// <summary>
/// The default tokenizer of LLamaSharp. This class should not be inherited.
/// <b>Note that this class has state. The previous outputs feeded to it will affect its control.</b>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// <b>Note that this class has state. The previous outputs feeded to it will affect its control.</b>
/// <b>Note that this class has state. The previous outputs fed to it will affect its output.</b>

/// <summary>
/// <inheritdoc/>
/// </summary>
public IEnumerable<int> Tokenize(LLamaContext context, string text, bool addBos = true, bool special = false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to accept a LLamaWeights in the constructor, instead of a LLamaContext in the Tokenize/Detokenize methids. That simplifies usage and allows you to use the tokenizer without creating an entire context (which is quite memory hungry).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change it, thank you!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there already a way to tokenize without a context? I didn't find a such method

Copy link
Member

@martindevans martindevans Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLamaWeights w = your_weights;
w.NativeHandle.Tokenize("a string");

Should do it. There should really be higher level wrappers for tokenization in LLamaWeights, so that you don't have to access the NativeHandle, but I haven't built them yet.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I mistook the one in NativeApi with context as parameter as the lowest level api. I'll add a wrapper for it with LLamaWeights. :)

{
/// <summary>
/// Decodes a stream of tokens into a stream of characters
/// </summary>
public sealed class StreamingTokenDecoder
{
private readonly SafeLlamaModelHandle _weights;
private readonly SafeLlamaModelHandle? _weights;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think any of the changes in this class are necessary if we go with the suggested change to the DefaultTokenizer

@martindevans
Copy link
Member

Over all I really like the changes 1, 2 and 3. Can we split them into 3 separate PRs so they can proceed separately?

@AsakusaRinne
Copy link
Collaborator Author

Over all I really like the changes 1, 2 and 3. Can we split them into 3 separate PRs so they can proceed separately?

Sure, I'll separate them into 3 PRs.

Besides there're some proposals I forgot to mention above:

  1. Support cuda backends with different avx levels.
  2. Mix the cuda11 backend and cuda12 backend to one because of the feature detection. Besides, allow users to specify the cuda version in NativeLibraryConfig to deal with some complex conditions.
  3. Make the withLog in NativeLibraryConfig be true by defualt because according to the issues, the information of it is very important for users to debug.
  4. Refactor the web service implementation after the support for batch decoding.

Also there're some features that makes me headache. I don't mean to complete them in v1.0.0 but just include them in the discussion.

  1. Support fine-tune.
  2. LLaVA support, or more generally, image2text support.
  3. Catch the error from llama.cpp. It's useful for automatic configuration of layers offloaded to VRAM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
break change help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants