-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added iOS refactored LlmInference, LlmTaskRunner and LlmSessionRunner #5566
Added iOS refactored LlmInference, LlmTaskRunner and LlmSessionRunner #5566
Conversation
priankakariatyml
commented
Aug 12, 2024
- Added LlmSessionRunner that manages a C session created by a C engine. Each swift Llm session will be managed by its own LlmSessionRunner.
- Refactored LlmTaskRunner to manage the C Llm Engine. All C methods that require the engine instance are invoked via the LlmTaskRunner
- Added refactored interface for LlmInference.
mediapipe/tasks/ios/genai/core/sources/LlmTaskRunnerRefactored.swift
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay to drop the Refactored
suffix so we can see the diffs? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. FYI, until we merge the all the code master will have an incomplete implementation.
mediapipe/tasks/ios/genai/core/sources/LlmTaskRunnerRefactored.swift
Outdated
Show resolved
Hide resolved
|
||
/// Creates a new C LLM session from the current C engine and returns an `LlmSessionRunner` | ||
/// that wraps around the newly created C session. The session runner is responsible for managing | ||
/// its underlying C session. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we mention that this method will always return a new instance even with the same LlmSessionConfig
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should. I have added a note in the comments.
} | ||
guard let cResponse = responseContext?.pointee else { | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this expected? Shall we throw an error here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tbh, not expected. I have modified the code to throw an error if the responseContext
guard fails. Also added some extra explanation to document the behaviour properly. The behaviour is similar to the previous version but just documenting it for clarity.
Can't throw an error if context
is nil ( First guard):
This check is in the body of the C callback. The context has to be non-nil for us to be able to invoke any swift callback which is where we can relay any errors at this point. So if its nil, can't do anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Prianka, left some comments
/// array being in memory as long as the engine isn't deallocated since | ||
/// `options.supportedLoraRanks` only has function scope. | ||
/// | ||
/// TODO: If C++ API mem copies the array the following code can be updated to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The C++ layer copies the supported_lora_ranks
array elements and I think we probably should update the C API with cost size_t* supported_lora_ranks
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated to withUnsfeMutable
. Also removed allocation of modelPath and cacheDir. They can also be passed using withCString
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it seems that you haven't committed the changes?
/// - Returns: | ||
/// - An `LlmSessionRunner` that wraps around a new session. | ||
/// - Throws: An error if the underlying engine could not create a session. | ||
func createSessionRunner(sessionConfig: LlmSessionConfig) throws -> LlmSessionRunner { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm, LlmSessionConfig is the C struct right? Can we also create a swift class for this? Similar to LlmInference option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have Session Options in swift. I have not committed the file for the Session
to keep the current PR size manageable. Swift Options
are also declared in that file.
This method is not public. It has internal visibility and is called by the init
of the swift Session
. Session.init(llmInference: options:)
creates the C struct LlmSessionConfig
from the swift options provided by the user and then calls llmInference.createSessionRunner(sessionConfig)
.
I'll commit the rest of the code once this PR is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also change this method to createSessionRunner(options: Session.Options)
and let this method handle creation of the C session config from the options passed in by the Session. I just thought it would be better for the Session to convert its options to the config however it sees fit and not let the LlmInference worry about it.
LMK what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is not public. It has internal visibility and is called by the init of the swift Session. Session.init(llmInference: options:) creates the C struct LlmSessionConfig from the swift options provided by the user and then calls llmInference.createSessionRunner(sessionConfig).
Thanks for the explanation! If you don't mind, can we make all changes in one PR? We have some internal dependency on this swift API, I hope we can make all necessary changes in one PR. Let me know if you have other opinions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed all the code to this PR. Do have a look
/// - Returns: | ||
/// - An `LlmSessionRunner` that wraps around a new session. | ||
/// - Throws: An error if the underlying engine could not create a session. | ||
func createSessionRunner(sessionConfig: LlmSessionConfig) throws -> LlmSessionRunner { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is not public. It has internal visibility and is called by the init of the swift Session. Session.init(llmInference: options:) creates the C struct LlmSessionConfig from the swift options provided by the user and then calls llmInference.createSessionRunner(sessionConfig).
Thanks for the explanation! If you don't mind, can we make all changes in one PR? We have some internal dependency on this swift API, I hope we can make all necessary changes in one PR. Let me know if you have other opinions.
/// array being in memory as long as the engine isn't deallocated since | ||
/// `options.supportedLoraRanks` only has function scope. | ||
/// | ||
/// TODO: If C++ API mem copies the array the following code can be updated to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it seems that you haven't committed the changes?
|
||
llmTaskRunner = try options.modelPath.withCString { modelPath in | ||
try cacheDirectory.withCString { cacheDirectory in | ||
try options.supportedLoraRanks.withUnsafeMutableBufferPointer { supportedLoraRanks in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
w.r.t #5566 (comment)
@yishuangP These changes were already pushed. I think there was some git refresh issue.
private let llmSessionRunner: LlmSessionRunner | ||
|
||
// LLM Inference used to create this session. | ||
private let llmInference: LlmInference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yishuangP I am storing a strong reference to llmInference
to ensure that any given session does not out live the llmInference
and create all sorts of undefined behaviour. This won't cause a retain cycle since llmInference does not store a reference to any session. This means the ref would get deallocated only after all instances of session created from it are destroyed.
Also lets me synchronise the response generation calls across sessions to fix a crash because of simultaneous response generation calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Yeah agree that we need a strong reference to llmInference
here to manage life cycles.
Also lets me synchronise the response generation calls across sessions to fix a crash because of simultaneous response generation calls.
I believe only 1 response generation can happen at a time, is this what you refer to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. across sessions.
/// to execute response generation. If response generation is already in progress, throws an | ||
/// error. | ||
/// Any | ||
func shouldContinueWithResponseGeneration() throws { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yishuangP I am maintaining a response generation state in the LlmInference
which the session can read and update through these methods to prevent simultaneous response generation calls from multiple sessions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! Can you help fix the sample app examples/llm_inference/ios/InferenceExample in a follow-up PR?
private let llmSessionRunner: LlmSessionRunner | ||
|
||
// LLM Inference used to create this session. | ||
private let llmInference: LlmInference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Yeah agree that we need a strong reference to llmInference
here to manage life cycles.
Also lets me synchronise the response generation calls across sessions to fix a crash because of simultaneous response generation calls.
I believe only 1 response generation can happen at a time, is this what you refer to?
let humanReadableLlmResponse = Session.humanReadableString( | ||
llmResponses: responseStrings, stripLeadingWhitespaces: receivedFirstToken) | ||
else { | ||
progress(nil, GenAiInferenceError.invalidResponse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also call self?.llmInference.markResponseGenerationCompleted()
if it errors out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean if it errors out in the progress callback? I am assuming, even if one of the partial responses is invalid, the C++ session would still continue giving callbacks for the remaining responses?
In that case wouldn't marking it completed, still cause the same issue? By design, irrespective of whether the response returned by C++ is invalid, the session runner will also call completion()
after progress()
if the response context marks the bool done
as true.
LMK if this is not how C++ behaves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But thanks to your query, I spotted an issue in the control flow of sync predict()
when an error happens and have fixed the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. I took a look at llmSessionRunner.predictAsync implementation, we don't need to call self?.llmInference.markResponseGenerationCompleted()
here.
Sure. I will keep this ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
let humanReadableLlmResponse = Session.humanReadableString( | ||
llmResponses: responseStrings, stripLeadingWhitespaces: receivedFirstToken) | ||
else { | ||
progress(nil, GenAiInferenceError.invalidResponse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. I took a look at llmSessionRunner.predictAsync implementation, we don't need to call self?.llmInference.markResponseGenerationCompleted()
here.
sequence_batch_size: LlmInference.sequenceBatchSize, | ||
number_of_supported_lora_ranks: options.supportedLoraRanks.count, | ||
supported_lora_ranks: supportedLoraRanks.baseAddress, | ||
max_top_k: options.maxTopk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yishuangP @schmidt-sebastian This PR is out of sync with the C llm_inference_engine on master. You won't be able to merge this one because of a change in the LlmModelSettings
.
Can you pass
llm_activation_data_type: LlmActivationDataType(0), num_draft_tokens: 0
after max_top_k
to this API while merging. That should fix it. I'll add these new options to the Swift APIs once this PR is merged.