Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to Large Model and docs requirements #2468

Merged
merged 10 commits into from
Jul 26, 2023

Conversation

sekyondaMeta
Copy link
Contributor

Description

Updates to large model inference doc per Issue #2438

Adding requirements to requirements.txt to build incomplete docs per issue #2048

Fixes #(issue)

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

Docs built locally and they work:

Screenshot 2023-07-17 at 12 51 35 PM

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

Updates to large model inference doc per Issue pytorch#2438

Adding requirements to requirements.txt to build incomplete docs per issue pytorch#2048
@codecov
Copy link

codecov bot commented Jul 17, 2023

Codecov Report

Merging #2468 (75bbb2c) into master (31b42e8) will not change coverage.
The diff coverage is n/a.

❗ Current head 75bbb2c differs from pull request most recent head e8dabf0. Consider uploading reports for the commit e8dabf0 to get more accurate results

@@           Coverage Diff           @@
##           master    #2468   +/-   ##
=======================================
  Coverage   72.66%   72.66%           
=======================================
  Files          78       78           
  Lines        3669     3669           
  Branches       58       58           
=======================================
  Hits         2666     2666           
  Misses        999      999           
  Partials        4        4           

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@sekyondaMeta sekyondaMeta requested review from msaroufim and lxning July 17, 2023 17:38

To reduce model latency we recommend:
* Pre-installing the model parallel library such as Deepspeed on the container or host.
* Pre-downloading the model checkpoints. For example, if using HuggingFace, a pretrained model can be pre-downloaded via [Download_model.py](https://github.com/pytorch/serve/blob/75f66dc557b3b67a3ab56536a37d7aa21582cc04/examples/large_models/deepspeed/opt/Readme.md?plain=1#L7)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for link can we please link to section instead of line?

#### Tune torchrun parameters
User is able to tune torchrun parameters in [model config YAML file](https://github.com/pytorch/serve/blob/2f1f52f553e83703b5c380c2570a36708ee5cafa/model-archiver/README.md?plain=1#L179). The supported parameters are defined at [here](https://github.com/pytorch/serve/blob/2f1f52f553e83703b5c380c2570a36708ee5cafa/frontend/archive/src/main/java/org/pytorch/serve/archive/model/ModelConfig.java#L329). For example, by default, `OMP_NUMBER_THREADS` is 1. It can be modified in the YAML file.
You can tune the model config YAML file to get better performance in the following ways:
* Update the [responseTimeout](https://github.com/pytorch/serve/blob/5ee02e4f050c9b349025d87405b246e970ee710b/docs/configuration.md?plain=1#L216) if high model loading or inference latency causes response timeout.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long model loading?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

User is able to tune torchrun parameters in [model config YAML file](https://github.com/pytorch/serve/blob/2f1f52f553e83703b5c380c2570a36708ee5cafa/model-archiver/README.md?plain=1#L179). The supported parameters are defined at [here](https://github.com/pytorch/serve/blob/2f1f52f553e83703b5c380c2570a36708ee5cafa/frontend/archive/src/main/java/org/pytorch/serve/archive/model/ModelConfig.java#L329). For example, by default, `OMP_NUMBER_THREADS` is 1. It can be modified in the YAML file.
You can tune the model config YAML file to get better performance in the following ways:
* Update the [responseTimeout](https://github.com/pytorch/serve/blob/5ee02e4f050c9b349025d87405b246e970ee710b/docs/configuration.md?plain=1#L216) if high model loading or inference latency causes response timeout.
* Tune the [torchrun parameters](https://github.com/pytorch/serve/blob/2f1f52f553e83703b5c380c2570a36708ee5cafa/model-archiver/README.md?plain=1#L179). The supported parameters are defined at [here](https://github.com/pytorch/serve/blob/2f1f52f553e83703b5c380c2570a36708ee5cafa/frontend/archive/src/main/java/org/pytorch/serve/archive/model/ModelConfig.java#L329). For example, by default, `OMP_NUMBER_THREADS` is 1. This can be modified in the YAML file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it clearer that torchrun is a job launcher

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

return ["hello world "]
```
Client side receives the chunked data.
```python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the import statements for test_utils here so this code can actually run

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

if type(data) is list:
for i in range (3):
send_intermediate_predict_response(["intermediate_response"], context.request_ids, "Intermediate Prediction success", 200, context)
return ["hello world "]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this line needed? confusing why intermediate responses and a handle response both exist

if chunk:
prediction.append(chunk.decode("utf-8"))

assert str(" ".join(prediction)) == "hello hello hello hello world "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this example is confusing, can we have some plausible response back with a real model? The echo_stream model doesnt make it clear why this feature is useful


#### GRPC Server Side Streaming

TorchServe [GRPC API](grpc_api.md) adds server side streaming of the inference API "StreamPredictions" to allow a sequence of inference responses to be sent over the same GRPC stream. This API is only recommended for use case when the inference latency of the full response is high and the inference intermediate results are sent to client. An example could be LLMs for generative applications, where generating "n" number of tokens can have high latency, in this case the user can receive each generated token once ready until the full response completes. This API automatically forces the batchSize to be one.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the justification here I'm not sure i follow what value this adds on top of the http response streaming work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea was supposed to be if you are already using (or already decided to use) one instead of the other . That said, I do see what you mean, will think of a better way o phrase this.

@msaroufim msaroufim self-requested a review July 23, 2023 18:05
#### Tune torchrun parameters
User is able to tune torchrun parameters in [model config YAML file](https://github.com/pytorch/serve/blob/2f1f52f553e83703b5c380c2570a36708ee5cafa/model-archiver/README.md?plain=1#L179). The supported parameters are defined at [here](https://github.com/pytorch/serve/blob/2f1f52f553e83703b5c380c2570a36708ee5cafa/frontend/archive/src/main/java/org/pytorch/serve/archive/model/ModelConfig.java#L329). For example, by default, `OMP_NUMBER_THREADS` is 1. It can be modified in the YAML file.
You can tune the model config YAML file to get better performance in the following ways:
* Update the [responseTimeout](https://github.com/pytorch/serve/blob/5ee02e4f050c9b349025d87405b246e970ee710b/docs/configuration.md?plain=1#L216) if high model loading or inference latency causes response timeout.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@msaroufim msaroufim merged commit 61f1c41 into pytorch:master Jul 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants