Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: optimized huggingface model support #354

Merged

Conversation

KutalVolkan
Copy link
Contributor

@KutalVolkan KutalVolkan commented Aug 31, 2024

Overview

The goal of this task is to reintroduce support for HuggingFace models within the PyRIT framework after the original HuggingFaceChat was removed due to long model loading time. The aim is to enable support for lightweight models that can run efficiently on a normal machine, while also providing flexibility for more complex setups (e.g., cloud-based inference for larger models).

Work Completed

  1. Reintroducing HuggingFaceChat Class:

    • Refactored the existing HuggingFaceChat class to create an optimized HuggingFaceChatTarget class that aligns with the current PromptTarget design standards.
  2. Test Suite for HuggingFaceChatTarget:

    • Refactored the existing test suite to align with the newly optimized HuggingFaceChatTarget class structure.
  3. Jupytext Integration and Notebook Creation:

    • Created a .py file that serves as both a Python script and a Jupyter notebook using Jupytext, demonstrating the HuggingFaceChatTarget usage.

Next Steps

  1. Optimize Model Loading for Faster Execution:

    • Investigate smaller or optimized models that load quickly, such as Phi3 or similar lightweight models.
  2. Refactor and Improve Message Parsing:

    • Simplify the logic for detecting specific strings (e.g., "ASSISTANT") and parsing responses. Evaluate new libraries or approaches that can make this process more robust and efficient.
  3. Enhance Test Coverage and Mocking:

    • If needed, expand the test suite to cover more edge cases and error scenarios.
  4. Future Support for Cloud-Based HuggingFace Models:

    • Consider the addition of a HuggingFace target capable of interacting with cloud-hosted endpoints for large models, similar to PyRIT's Azure ML Target
    • Current work has focused on optimizing and refactoring the HuggingFaceChatTarget for local model usage, but cloud integration remains a potential future enhancement.

Action Items

  • Reintroduce HuggingFaceChatTarget class for dynamic HuggingFace model support.
  • Add tests to cover initialization, model loading, and response generation with efficient mocking strategies.
  • Integrate a Jupyter notebook using Jupytext to demonstrate the new target.
  • Optimize model loading with lightweight models.
  • Refactor message parsing logic for better maintainability.
  • Improve test suite with enhanced mocking and edge case coverage.
  • Develop support for cloud-based HuggingFace models and endpoints.
  • Update and align notebooks with the rest of the PyRIT repository.

Contributors and Feedback

This work is based on suggestions from @romanlutz, who outlined potential areas of improvement and future work. The actual code was reused and built upon from @rdheekonda and @rlundeen2's prior implementations.


Related Issue: #347.

@KutalVolkan
Copy link
Contributor Author

Hi @romanlutz ,

I wanted to get your thoughts on the following suggestion: separating concerns by moving all model-related management (such as loading with different precisions, quantization, unloading, and saving models) into a dedicated utility submodule e.g. (model_management/model_utils.py).

Key Benefits of This Approach:

  1. Inference Time Optimization: By managing model precision (e.g., fp16, bf16, int8) and quantization separately, we can dynamically adjust model loading based on available hardware and desired trade-offs between performance and memory usage. For instance, quantizing a model from fp32 to int8 can significantly reduce memory footprint and speed up inference, which is crucial for faster response times.

  2. Separation of Concerns: Centralizing model loading, precision handling, quantization, unloading, and saving in a utility module keeps HuggingFaceChatTarget focused on its core task—managing prompt interactions—while offloading model-related complexities to a dedicated area.

  3. Maintainability and Reusability: This approach would make our code more maintainable and reusable. The utility functions can be leveraged by other targets or components, which is particularly useful if we decide to support more models or different configurations.

  4. Flexibility for Use Cases: The proposed structure allows for flexibility in handling specific use cases, such as changing a model’s precision on the fly based on user requirements (e.g., switching from fp32 to int8 for lighter setups) or optimizing for GPU vs. CPU inference scenarios.

I would love to hear your thoughts on whether this aligns with our scope and if there are any additional considerations to keep in mind. In the meantime, I will work on refactoring the message parsing logic and adding support for cloud-based HuggingFace models.

@romanlutz
Copy link
Contributor

Absolutely! In fact, you've thought about it a lot more thoroughly than I had so far 🙂 I guess I'd do this in chunks rather than one massive PR but I'm open to options. Wdyt?

@rdheekonda and @rlundeen2 have faced issues with the old target and may have thoughts, too.

I'm on the fence about creating a new module for this or just a folder (sub module?) within targets or common. I guess it doesn't really matter and users shouldn't have to interact with it so we should be able to move at a later point if necessary.

@romanlutz romanlutz changed the title [DRAFT] Feat/optimized huggingface model support [DRAFT] FEAT: optimized huggingface model support Sep 2, 2024
@KutalVolkan
Copy link
Contributor Author

Absolutely! In fact, you've thought about it a lot more thoroughly than I had so far 🙂 I guess I'd do this in chunks rather than one massive PR but I'm open to options. Wdyt?

Hi Roman,

Thanks for the feedback! I agree with breaking this into smaller chunks rather than one big PR.

@KutalVolkan
Copy link
Contributor Author

KutalVolkan commented Sep 4, 2024

Hello @romanlutz,

I made some quick changes by adding HuggingFaceEndpointTarget. The endpoint is more or less working, but I’m thinking of removing complete_chat—what are your thoughts on that? It’s a bit rough due to fast prototyping, but I’ll clean it all up in the next two days.

The local version of HuggingFaceChatTarget isn’t finished yet and will definitely take longer. It’s exciting to try optimizing it, but I'm currently running it on GPU (GeForce RTX 4080), and it takes about 1 minute for microsoft/Phi-3-mini-128k-instruct, so I might not have set something right.

Feedback and ideas are always welcome!

Thanks for your patience :)

@romanlutz
Copy link
Contributor

Yes, whatever makes sense. A lot of internal methods there may be "legacy" and can be updated.

@rdheekonda mentioned that we had issues with huge delays while pulling the models from HF and running locally. Even small models took a while. We don't want it to be prohibitively difficult to run the notebooks, for example, so that's something that needs solving....

For tests, that part could probably be mocked so that there's no network call.

@KutalVolkan
Copy link
Contributor Author

KutalVolkan commented Sep 4, 2024

Hello @romanlutz ,

Here is the workflow I will follow for now:

1. Profile the Model Loading Workflow

I will use cProfile and other timing and profiling tools to analyze the entire model loading workflow, covering:

  • Internet to Disk: Measure the time it takes to download models from Hugging Face to the local disk. This will help understand network-related delays.
  • Disk to Memory: Measure the time taken to load models from disk into memory.
  • Optional Model Initialization and Inference: Assess the initialization time and inference speed, which are important for understanding how models perform under different conditions (e.g., CPU vs. GPU).

2. Experiment with Class-Level Caching

To optimize the model loading process, I will implement class-level caching. This caching mechanism will store the model and tokenizer in memory after the first load, hopefully allowing subsequent instances of HuggingFaceChatTarget to reuse the cached objects instead of reloading them from disk. This could help reduce redundant loading times. I will test the impact of this approach to determine its effectiveness in practice.

3. Extend and Enhance Test Coverage

While profiling will give us performance data, I plan as you suggested to enhance the test coverage to ensure robustness and proper error handling. This includes:

  • Mocking Internet-to-Disk and Disk-to-Memory Phases: To simulate and test different conditions without actual downloads, such as slow network conditions, network unavailability, or disk I/O issues.
  • Error Handling for Network Issues: Mocking scenarios where network errors occur to validate that our code handles these gracefully.

4. Optimize Downloads with huggingface-cli

Using huggingface-cli with HF_HUB_ENABLE_HF_TRANSFER=1 for parallel downloads to speed up model fetching. I will integrate this as subprocess calls in the existing common folder and later call the method from the HuggingFaceChatTarget class. Considering aria2 as a fallback for more control if needed.

Let me know if there are any specific areas you’d like me to focus on.

@romanlutz
Copy link
Contributor

All these make sense!

Good point about caching. If the download is a one-time operation of a few minutes it's perhaps fine. If it takes 10 minutes every time that's not great 😄

… Update argument type in hugging_face_enpoint_target.py- Modify examples in use_hugginface_chat_target_endpoint.py- Update .env_example- Add new notebook use_hugginface_chat_target_endpoint.ipynb
@KutalVolkan
Copy link
Contributor Author

Hello @romanlutz,

You can run the notebook use_huggingface_chat_target.ipynb in Anaconda Prompt as Administrator to avoid permission issues and improve download efficiency.

My Observations:

  • Without Admin: It took around 529 seconds due to redundant file handling.
  • With Admin: It only took 0.66 seconds for downloading specific files and about 52 seconds for loading the model.

Note: The 0.66 seconds download time indicates that specific files were downloaded, but I may not have provided all the necessary files. With all required files specified, the entire process could potentially be reduced to under 1 minute.

To optimize the process, ensure to provide the necessary_files list to avoid downloading the entire model and speed up the setup.

There may still be additional bottlenecks, and the code is a bit messy :), but the main problem with downloading should be solved by running as admin.

Performance Data Snippet:

Error during table creation: (duckdb.duckdb.IOException) IO Error: File is already open in 
C:\Users\vkuta\anaconda3\envs\pyrit-dev\python.exe (PID 22208)
(Background on this error at: https://sqlalche.me/e/20/e3q8)
Model microsoft/Phi-3-mini-4k-instruct not fully found in cache. Downloading missing files using CLI...
Please make sure you are logged in to Hugging Face CLI. Running login process...
Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (manager).
Your token has been saved to C:\Users\vkuta\.cache\huggingface\token
Login successful

Successfully logged into Hugging Face.
The following files will be downloaded:
  - config.json
  - generation_config.json
  - model-00001-of-00002.safetensors
  - model-00002-of-00002.safetensors
  - special_tokens_map.json
  - tokenizer.json
  - tokenizer.model
  - tokenizer_config.json
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Downloaded files:
Specified files for microsoft/Phi-3-mini-4k-instruct downloaded successfully to the cache.
Download of specific files completed in 0.66 seconds.
Loading checkpoint shards: 100%
 2/2 [00:49<00:00, 25.32s/it]
         1097875 function calls (1044540 primitive calls) in 52.512 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000   52.511   26.256 C:\Users\vkuta\anaconda3\envs\pyrit-dev\Lib\site-packages\IPython\core\interactiveshell.py:3541(run_code)
   1008/2    0.021    0.000   52.511   26.256 {built-in method builtins.exec}
        1    0.000    0.000   52.511   52.511 C:\Users\vkuta\projects\PyRIT\pyrit\prompt_target\hugging_face_chat_target.py:28(__init__)
        1    0.000    0.000   52.286   52.286 C:\Users\vkuta\projects\PyRIT\pyrit\prompt_target\hugging_face_chat_target.py:68(load_model_and_tokenizer)
        1    0.000    0.000   50.149   50.149 C:\Users\vkuta\anaconda3\envs\pyrit-dev\Lib\site-packages\transformers\models\auto\auto_factory.py:444(from_pretrained)
        1    0.003    0.003   49.382   49.382 C:\Users\vkuta\anaconda3\envs\pyrit-dev\Lib\site-packages\transformers\modeling_utils.py:2746(from_pretrained)
        1    0.142    0.142   49.162   49.162 C:\Users\vkuta\anaconda3\envs\pyrit-dev\Lib\site-packages\transformers\modeling_utils.py:3846(_load_pretrained_model)
        2    0.005    0.003   46.857   23.428 C:\Users\vkuta\anaconda3\envs\pyrit-dev\Lib\site-packages\transformers\modeling_utils.py:650(_load_state_dict_into_model)
    846/2    0.035    0.000   46.852   23.426 C:\Users\vkuta\anaconda3\envs\pyrit-dev\Lib\site-packages\transformers\modeling_utils.py:676(load)
      299    0.089    0.000   46.779    0.156 C:\Users\vkuta\anaconda3\envs\pyrit-dev\Lib\site-packages\torch\nn\modules\module.py:1996(_load_from_state_dict)
      195   46.614    0.239   46.614    0.239 {method 'copy_' of 'torch._C.TensorBase' objects}
        2    2.017    1.008    2.035    1.017 {built-in method gc.collect}
        1    0.000    0.000    1.953    1.953 C:\Users\vkuta\projects\PyRIT\pyrit\common\huggingface_model_manager.py:90(download_specific_files_with_cli)
        2    0.000    0.000    1.725    0.862 C:\Users\vkuta\anaconda3\envs\pyrit-dev\Lib\subprocess.py:506(run)
       98    1.701    0.017    1.701    0.017 {method 'acquire' of '_thread.lock' objects}
        2    0.000    0.000    1.699    0.849 C:\Users\vkuta\anaconda3\envs\pyrit-dev\Lib\subprocess.py:1165(communicate)
        2    0.000    0.000    1.699    0.849 C:\Users\vkuta\anaconda3\envs\pyrit-dev\Lib\subprocess.py:1603(_communicate)
        4    0.000    0.000    1.696    0.424 C:\Users\vkuta\anaconda3\envs\pyrit-dev\Lib\threading.py:1087(join)
       44    0.000    0.000    1.696    0.039 C:\Users\vkuta\anaconda3\envs\pyrit-dev\Lib\threading.py:1125(_wait_for_tstate_lock)
        1    0.000    0.000    1.062    1.062 C:\Users\vkuta\projects\PyRIT\pyrit\common\huggingface_model_manager.py:16(login_to_huggingface)
  871/180    0.004    0.000    0.740    0.004 <frozen importlib._bootstrap>:1165(_find_and_load)
  203/178    0.000    0.000    0.740    0.004 C:\Users\vkuta\anaconda3\envs\pyrit-dev\Lib\importlib\__init__.py:108(import_module)
  203/178    0.000    0.000    0.739    0.004 <frozen importlib._bootstrap>:1192(_gcd_import)
   755/83    0.003    0.000    0.738    0.009 <frozen importlib._bootstrap>:1120(_find_and_load_unlocked)
   690/82    0.002    0.000    0.721    0.009 <frozen importlib._bootstrap>:666(_load_unlocked)
   686/82    0.001    0.000    0.719    0.009 <frozen importlib._bootstrap_external>:934(exec_module)
    32/16    0.000    0.000    0.705    0.044 C:\Users\vkuta\anaconda3\envs\pyrit-dev\Lib\site-packages\huggingface_hub\utils\_validators.py:98(_inner_fn)
 1551/165    0.001    0.000    0.692    0.004 <frozen importlib._bootstrap>:233(_call_with_frames_removed)
        4    0.000    0.000    0.684    0.171 C:\Users\vkuta\anaconda3\envs\pyrit-dev\Lib\site-packages\requests\sessions.py:500(request)
        4    0.000    0.000    0.674    0.169 C:\Users\vkuta\anaconda3\envs\pyrit-dev\Lib\site-packages\requests\sessions.py:673(send)
        4    0.000    0.000    0.673    0.168 C:\Users\vkuta\anaconda3\envs\pyrit-dev\Lib\site-packages\huggingface_hub\utils\_http.py:63(send)
        4    0.000    0.000    0.673    0.168 C:\Users\vkuta\anaconda3\envs\pyrit-dev\Lib\site-packages\requests\adapters.py:613(send)
        4    0.000    0.000    0.670    0.168 C:\Users\vkuta\anaconda3\envs\pyrit-dev\Lib\site-packages\urllib3\connectionpool.py:594(urlopen)
        4    0.000    0.000    0.669    0.167 C:\Users\vkuta\anaconda3\envs\pyrit-dev\Lib\site-packages\urllib3\connectionpool.py:379(_make_request)
27857/6489    0.007    0.000    0.604    0.000 {built-in method builtins.hasattr}
   117/94    0.000    0.000    0.603    0.006 C:\Users\vkuta\anaconda3\envs\pyrit-de
   ....

@romanlutz
Copy link
Contributor

Woah! That's a massive difference. I'll try this out and will report my findings. Might take me until tomorrow, though.

My first thought was that perhaps it did cache. The difference is so huge.

@KutalVolkan
Copy link
Contributor Author

KutalVolkan commented Sep 5, 2024

Woah! That's a massive difference. I'll try this out and will report my findings. Might take me until tomorrow, though.

My first thought was that perhaps it did cache. The difference is so huge.

Unfortunately, you are right; it was cached. I deleted the cache and ran it again, and it downloaded model-00001-of-00002.safetensors, etc., again even though they were downloaded before.

I will play around with this tomorrow; it may be connected to some path issue.

image

Note: After disabling my VPN 😂, my download speed improved to 381.05 Mbps (~47.6 MB/s), reducing download time to
2 min 42 sec for the Phi-3-Mini-4K-Instruct (3.8B parameters) model. With a faster connection, it could be even quicker. The Class-Level Caching is not yet implemented, so it should take around 3 minutes only on the first run.

image

…- Cleaned up code and improved organization for model download scripts using Hugging Face Hub, aria2, and Transformers libraries.- Added pytest cases to test different model download variants, ensuring they handle caching and folder creation properly.- Updated documentation notebook to reflect the latest changes and cleaned up unnecessary content.
@KutalVolkan
Copy link
Contributor Author

KutalVolkan commented Sep 7, 2024

Hello @romanlutz ,

I’ve been employing different methods to download the Phi-3-Mini-4K-Instruct model. Here are the details of the setups and their performance:

  1. Hugging Face CLI with Parallel Downloads:

    • Setting: Enabled parallel downloads by setting HF_HUB_ENABLE_HF_TRANSFER=1.
    • model-00001-of-00002.safetensors:
      • File Size: 4.97 GB
      • Download Time: 101 seconds
      • Calculated Speed: Approximately 50.43 MB/s
    • model-00002-of-00002.safetensors:
      • File Size: 2.67 GB
      • Download Time: 52 seconds
      • Calculated Speed: Approximately 52.58 MB/s
  2. Aria2 Setup:

    • I also experimented with aria2c to see if there could be an improvement in download speeds using its capabilities to handle multiple connections. However, the speeds observed were similar to those achieved with the Hugging Face CLI setup.
  3. Default Setup:

    • Additionally, I tested the standard method provided by the transformers library. While I had previously set the HF_HUB_ENABLE_HF_TRANSFER=1 environment variable for other tests, it is unclear whether this setting automatically influences the default download process in the Transformers library. Despite this uncertainty, the observed download speeds were similar to those achieved with methods where parallel downloads were explicitly enabled.

My best download time was 2 min and 14 sec , with another instance taking 2 min and 42 sec . These speeds can vary based on the time of day, such as early morning or late evening, depending on how heavily the servers are loaded.

Despite these diverse approaches, my experiences have been quite consistent across the board. My internet connection offers up to 381.05 Mbps (~47.6 MB/s) , and at times, the download speeds slightly exceed this limit, possibly due to ISP speed bursts. This performance suggests that the HF CLI setup with parallel downloads enabled appears to be optimal among the methods I've tested, efficiently utilizing my internet capacity.

I am now shifting my focus to work on the inference part and will revisit the download strategies later.

It would be helpful to hear about your experiences or from other team members (@rdheekonda, @rlundeen2), particularly any download speeds or feedback you could share. If you've conducted any tests or have insights into optimizing these downloads further, your input would be invaluable.


For more information about optimizing download speeds using the Hugging Face CLI and enabling hf_transfer for faster transfers, check out the environment variables guide and the CLI download guide.

To calculate the download speed, i used the formula:

Speed (MB/s) = File Size (MB) / Download Time (seconds)

For example:

  • File Size: 4.97 GB = 4.97 * 1024 MB = 5093.28 MB
  • Download Time: 101 seconds
Speed = 5093.28 MB / 101 s ≈ 50.43 MB/s

Best,
Volkan 😄

@rdheekonda
Copy link
Contributor

Thank you for the details, @KutalVolkan . While downloading is one aspect during PyRIT demos testing, HuggingFace typically checks the default cache directory, and since we'll be using the same model for demos testing, it should already exist.

However, our main focus for this feature is on speeding up inference by sending a few text prompts and observing the response in CPU based compute where we run the demos before creating a PR to ensure existing functionality is not broken with new features. Could you test with a few smaller models and check the inference time using the PromptSendingOrchestrator?

The models don’t have to be from Microsoft, as HuggingFace hosts models from various companies. Here’s another 135M model with a 269MB size, which I believe will have faster inference compared to phi-3-4k-instruct.

Again, the goal is to optimize the notebook demo with a small yet effective instruction model so that users can experiment with different HuggingFace models.

@KutalVolkan
Copy link
Contributor Author

KutalVolkan commented Sep 10, 2024

Thank you for the details, @KutalVolkan . While downloading is one aspect during PyRIT demos testing, HuggingFace typically checks the default cache directory, and since we'll be using the same model for demos testing, it should already exist.

However, our main focus for this feature is on speeding up inference by sending a few text prompts and observing the response in CPU based compute where we run the demos before creating a PR to ensure existing functionality is not broken with new features. Could you test with a few smaller models and check the inference time using the PromptSendingOrchestrator?

The models don’t have to be from Microsoft, as HuggingFace hosts models from various companies. Here’s another 135M model with a 269MB size, which I believe will have faster inference compared to phi-3-4k-instruct.

Again, the goal is to optimize the notebook demo with a small yet effective instruction model so that users can experiment with different HuggingFace models.

Hello @rdheekonda ,

Thank you for the feedback!

I have tested the inference times for two smaller models using the PromptSendingOrchestrator on CPU-based compute. Here are the results:

  1. Model: microsoft/Phi-3-mini-4k-instruct

    • Prompts Used: ["What is 3x3?", "What is 4x4?", "What is 5x5?", "What is 6x6?", "What is 7x7?"]
    • Inference Time: Approximately 25.7 seconds
    • Average Inference Time per Prompt: Approximately 5.14 seconds
  2. Model: HuggingFaceTB/SmolLM-135M-Instruct

    • Prompts Used: ["What is 3x3?", "What is 4x4?", "What is 5x5?", "What is 6x6?", "What is 7x7?"]
    • Inference Time: Approximately 12.8 seconds
    • Average Inference Time per Prompt: Approximately 2.56 seconds

Both models were run on CPU, but our current implementation also supports running on GPU.

Note: Increasing the max_new_tokens parameter leads to an increase in inference time. This is something to consider when optimizing for speed in different scenarios.

I'll test few other models afterwards be cleaning up the code and considering ways to extract only the assistant's response.

Please let me know if there's anything else you'd like me to test or if any further adjustments are needed from my side.

@KutalVolkan
Copy link
Contributor Author

Hello @rdheekonda,

Thank you for your feedback, I really appreciate it! Please feel free to review again and suggest any new enhancements if needed.

Regarding the model for the demo, HuggingFaceTB/SmolLM-135M-Instruct, I am not satisfied with the output. It doesn’t seem to answer the prompt correctly, even when I set the max_token to 100. The model I’ve found to work best is microsoft/Phi-3-mini-4k-instruct.

Perhaps the demo is intended primarily for inference time testing.

@rdheekonda
Copy link
Contributor

Hello @rdheekonda,

Thank you for your feedback, I really appreciate it! Please feel free to review again and suggest any new enhancements if needed.

Regarding the model for the demo, HuggingFaceTB/SmolLM-135M-Instruct, I am not satisfied with the output. It doesn’t seem to answer the prompt correctly, even when I set the max_token to 100. The model I’ve found to work best is microsoft/Phi-3-mini-4k-instruct.

Perhaps the demo is intended primarily for inference time testing.

Thanks, @KutalVolkan, for addressing the comments. I’ve reviewed and added a few more comments. Let me know once they’re resolved, and it will be good to merge.

@KutalVolkan
Copy link
Contributor Author

KutalVolkan commented Oct 1, 2024

Thanks, @KutalVolkan, for addressing the comments. I’ve reviewed and added a few more comments. Let me know once they’re resolved, and it will be good to merge.

Hello @rdheekonda,

Thanks for the additional feedback! I've addressed the new comments. Please review and let me know if anything else is needed.

Copy link
Contributor

@rdheekonda rdheekonda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, few things before merging

  1. One comment needs to be addressed.
  2. Merge onto main

@KutalVolkan
Copy link
Contributor Author

Looks great, few things before merging

  1. One comment needs to be addressed.
  2. Merge onto main

Hello Raja,

I hope I have made the changes you requested. If anything is still not quite right, I might need a bit more clarification. Please let me know if any further adjustments are needed.

Thank you for your support :)

@rdheekonda
Copy link
Contributor

Looks great, few things before merging

  1. One comment needs to be addressed.
  2. Merge onto main

Hello Raja,

I hope I have made the changes you requested. If anything is still not quite right, I might need a bit more clarification. Please let me know if any further adjustments are needed.

Thank you for your support :)

It's been addressed in the latest commit 44b9251
Thanks.

@rdheekonda
Copy link
Contributor

Upon running build, unit tests are failing with torch module not found in the environment. Could you add necessary dependencies to the pyproject.yaml so that we retry running the build?

@KutalVolkan
Copy link
Contributor Author

Upon running build, unit tests are failing with torch module not found in the environment. Could you add necessary dependencies to the pyproject.yaml so that we retry running the build?

Hello Raja,

The dependency issue has been addressed. Thank you for pointing it out. If anything else needs to be done, I'm happy to make further adjustments.

@romanlutz
Copy link
Contributor

I don't see any changes to pyproject.toml. Am I missing something?

@KutalVolkan
Copy link
Contributor Author

I don't see any changes to pyproject.toml. Am I missing something?

Is what I did correct? I thought i addressed it here 14f6a3f

@KutalVolkan
Copy link
Contributor Author

KutalVolkan commented Oct 3, 2024

Hello Roman & Raja,

I see the error regarding the Hugging Face token. I will handle that by tomorrow by mocking the token.

@rdheekonda
Copy link
Contributor

Hello Roman & Raja,

I see the error regarding the Hugging Face token. I will handle that by tomorrow by mocking the token.

Thanks, @KutalVolkan. Here's the reference code for how we mocked the environment variables in our pytests.

@KutalVolkan
Copy link
Contributor Author

Hello @romanlutz & @rdheekonda,

I’ve updated the tests for HuggingFaceChatTarget to mock get_required_value. I’ve also added a test to ensure a ValueError is raised when the token is missing, thanks to Raja’s code snippet reference.

Fingers crossed this works as expected! 😊

@romanlutz
Copy link
Contributor

I don't see any changes to pyproject.toml. Am I missing something?

Is what I did correct? I thought i addressed it here 14f6a3f

Somehow it wasn't showing before. This is fine.

@rdheekonda rdheekonda changed the title [DRAFT] FEAT: optimized huggingface model support FEAT: optimized huggingface model support Oct 6, 2024
@rdheekonda rdheekonda merged commit 4eed35d into Azure:main Oct 6, 2024
5 checks passed
@rdheekonda
Copy link
Contributor

Merged the PR. Excellent work, @KutalVolkan !

@KutalVolkan
Copy link
Contributor Author

KutalVolkan commented Oct 6, 2024

Merged the PR. Excellent work, @KutalVolkan !

Thank you all, for your support!

@KutalVolkan KutalVolkan deleted the feat/optimized-huggingface-model-support branch October 7, 2024 05:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants