✨ Short term accelerated compute instance #4120

bcrawford-moj · 2024-04-19T07:38:43Z

Describe the feature request.

Short term provisioning of an accelerated compute instance.

Describe the context.

In the BOLD programme we are producing a publication on the number of prisoners with children. We have developed a methodology involving LLMs which checks whether prison case notes imply the prisoner has a child. The output will be an Official Statistics in Development report due for publication around end of May.

We used the AP to run the LLM over the case notes but obviously it is quite slow (takes about a week to churn through them). The QA process has highlighted some changes we need to make.

Value / Purpose

This will allow us to meet our publication deadline.

We believe this will be the first time LLMs have been used in producing official statistics (and indeed one of our models was fine-tuned using generated labeled data, so we think it will also be the first time genAI has been used).

We are happy to have associated costs journaled to the BOLD programme.

User Types

Data scientists

Proposed solution:

AntFMoJ · 2024-04-26T11:16:31Z

@bcrawford-moj do you have a sense of resource requirements at this time.

e.g. do you have an estimate on the file size of the input data being processed by the LLM.

yznlp · 2024-04-26T13:06:21Z

Hi @AntFMoJ I'm the one working with the model.

Not too familiar with how the infrastructure/resource provision works, but I imagine we just need something similar to what's available on the AP in terms of CPU and RAM, but with a GPU enabled with 16GB or even 8GB VRAM.

Reasoning: the entire model runs on the AP by splitting the data into smaller chunks. For the LLM/transformer component it's a fairly small model (44M parameters) which should fit comfortably on 8GB of VRAM. The inputs are only 128 tokens (~words) long, and we will be able to scale the model to the amount of VRAM available.

Thanks for your help and please let me know if something doesn't make sense :)

BrianEllwood · 2024-04-29T15:20:20Z

Based on user resource requirements we would recommend a p3.2xlarge instance.

BrianEllwood · 2024-05-02T07:25:40Z

Draft PR created to build GPU node group

BrianEllwood · 2024-05-02T09:33:28Z

Hi, @yznlp @bcrawford-moj

Can you please confirm the name of the AMI you have been using in your testing ?

Thanks

bcrawford-moj · 2024-05-02T11:34:47Z

I'm not exactly sure what the AMI name is. Is it what would appear in the dropdown on the control panel?

BrianEllwood · 2024-05-02T12:37:05Z

Thanks for getting back to me, we will need to investigate further.

AntFMoJ · 2024-05-03T14:15:10Z

GPU node group created and tested configuring pod to access GPU resources which worked correctly, although this required the taint and label to be removed temporarily. Next step is to resolve issue with taint/daemonset interaction.

BrianEllwood · 2024-05-09T15:01:59Z

GPU node group and pod creation tested successfully

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.4     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-16GB           On  | 00000000:00:1E.0 Off |                    0 |
| N/A   30C    P0              23W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

AntFMoJ · 2024-05-13T13:43:58Z

vscode deployed on the GPU-enabled node pool from the control panel dev environment.
Running ollama on the vscode terminal shows GPU resource can't be found, CUDA drivers need to be added to the vscode image to resolve this issue.

jacobwoffenden · 2024-05-16T12:06:24Z

Summary:

Drivers aren't yet available for Ubuntu 24.04, therefore we've downgraded (cut a new release of 1.2.0, the last Ubuntu 22.04 release) to NVIDIA's CUDA base image (ministryofjustice/analytical-platform-visual-studio-code#69), this deploys and is able to run Ollama with GPU capability

jacobwoffenden · 2024-05-16T12:10:25Z

Notes:

With the current taints/tolerations, only one GPU enabled workload is schedulable per node, meaning each p3.2xlarge deployed is severely underutilised (pod max memory is 12G, of 64G~ available), need to experiment with if GPU sharing is possible, if not, might as well increase the GPU release to use a lot more RAM

EDIT:

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html

https://aws.amazon.com/blogs/containers/gpu-sharing-on-amazon-eks-with-nvidia-time-slicing-and-accelerated-ec2-instances/

BrianEllwood · 2024-05-23T14:46:13Z

Follow on tickets

Documentation

GPU node scale down

bcrawford-moj · 2024-05-23T16:38:57Z

Thanks very much for your work on this! Could we provide the following users access please:

@bcrawford-moj
@carolinetudor
@yznlp
@DaoudC
@wmartin-gss

AntFMoJ · 2024-05-24T08:07:57Z

@bcrawford-moj we have added the users you provided above, you should now be able to open Visual Studio Code:1.2.0-nvidia-cuda-base (GPU-Enabled) in the control panel.

If you or any of the above users have any issues please let me know.

Just to note, there is initially a limit on how many users can use the GPU at a time, so only one or two of your team will be able to deploy the GPU-enabled VSCode on control panel. We are have raised a story to improve on this limitation and will update you as this progresses.

bcrawford-moj · 2024-05-24T08:16:15Z

Thanks so much! Very excited to use this. Fyi we have a lot of AL over the next few weeks bc so I wouldn't expect that limitation to be an issue in the short term

yznlp · 2024-07-12T09:25:10Z

@AntFMoJ Hi currently on the AP there are two options with [vscode 1.2.0-nvidia-cuda-base]. One of them is marked with "retired" and this is the one I'm currently on. Can I ask what the difference is between the two, and whether the retired one is scheduled to be deleted at some point (so I'll know to move my work)?

BrianEllwood · 2024-07-12T11:13:03Z

Hi @yznlp, commenting on on a closed issue is probably not the best way to ask a question, in future please use the #analytical-platform-support slack channel.

That said the difference between the releases is solely to do with the pods idle time and there should be no need to move your work as it does not affect any file persistence. I would have thought that the retired version would not open, so please use the other version

yznlp · 2024-07-12T11:50:34Z

Got it thank you :)

bcrawford-moj added the feature-request label Apr 19, 2024

moj-data-platform-robot added this to Analytical Platform Apr 19, 2024

github-project-automation bot moved this to 👀 TODO in Analytical Platform Apr 19, 2024

Ed-Bajo added the data-platform-apps-and-tools This issue is owned by Data Platform Apps and Tools label Apr 23, 2024

AntFMoJ mentioned this issue Apr 25, 2024

📌 Upgrade AP EKS #2907

Closed

4 tasks

AntFMoJ assigned AntFMoJ, BrianEllwood and Emterry Apr 26, 2024

jacobwoffenden moved this from 👀 TODO to 🚀 In Progress in Analytical Platform Apr 29, 2024

github-actions bot mentioned this issue May 1, 2024

Monthly issue metrics report #4248

Closed

AntFMoJ mentioned this issue May 1, 2024

Add GPU-enabled node group - development cluster #4253

Closed

jacobwoffenden changed the title ~~Short term accelerated compute instance~~ ✨ Short term accelerated compute instance May 13, 2024

This was referenced May 15, 2024

🐞 .NET SDK failing to install in VS Code image #4327

Closed

🔀 Switch base image to NVIDIA's Ubuntu 22.04 CUDA base ministryofjustice/analytical-platform-visual-studio-code#69

Closed

jacobwoffenden assigned jacobwoffenden and Gary-H9 May 16, 2024

jacobwoffenden removed the data-platform-apps-and-tools This issue is owned by Data Platform Apps and Tools label May 16, 2024

This comment has been minimized.

Sign in to view

AntFMoJ mentioned this issue May 17, 2024

Add gpu-compute node pool in ap-development cluster #4341

Merged

BrianEllwood mentioned this issue May 21, 2024

Create and configure the Nvidia daemon set #4371

Merged

AntFMoJ mentioned this issue May 22, 2024

Add gpu node pool and nvidia-driver daemonset in prod #4387

Merged

bcrawford-moj mentioned this issue May 22, 2024

✨ Access to Bedrock - AI for Linked Data Projects #4388

Closed

jacobwoffenden unassigned jacobwoffenden and Gary-H9 May 22, 2024

BrianEllwood mentioned this issue May 23, 2024

📖 Create documentation for Short term accelerated compute #4400

Closed

2 tasks

BrianEllwood mentioned this issue May 23, 2024

📖 Short term accelerated compute instance - GPU node scale down #4402

Closed

5 tasks

BrianEllwood moved this from 🚀 In Progress to 🎉 Done in Analytical Platform May 23, 2024

BrianEllwood closed this as completed by moving to 🎉 Done in Analytical Platform May 23, 2024

jacobwoffenden mentioned this issue May 29, 2024

📖 Install NVIDIA CUDA drivers in Visual Studio Code image #4418

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ Short term accelerated compute instance #4120

✨ Short term accelerated compute instance #4120

bcrawford-moj commented Apr 19, 2024 •

edited by BrianEllwood

Loading

AntFMoJ commented Apr 26, 2024

yznlp commented Apr 26, 2024

BrianEllwood commented Apr 29, 2024

BrianEllwood commented May 2, 2024

BrianEllwood commented May 2, 2024

bcrawford-moj commented May 2, 2024

BrianEllwood commented May 2, 2024

AntFMoJ commented May 3, 2024

BrianEllwood commented May 9, 2024

AntFMoJ commented May 13, 2024 •

edited

Loading

jacobwoffenden commented May 16, 2024

jacobwoffenden commented May 16, 2024 •

edited

Loading

This comment has been minimized.

BrianEllwood commented May 23, 2024

bcrawford-moj commented May 23, 2024

AntFMoJ commented May 24, 2024 •

edited

Loading

bcrawford-moj commented May 24, 2024

yznlp commented Jul 12, 2024

BrianEllwood commented Jul 12, 2024

yznlp commented Jul 12, 2024

✨ Short term accelerated compute instance #4120

✨ Short term accelerated compute instance #4120

Comments

bcrawford-moj commented Apr 19, 2024 • edited by BrianEllwood Loading

Describe the feature request.

Describe the context.

Value / Purpose

User Types

Proposed solution:

AntFMoJ commented Apr 26, 2024

yznlp commented Apr 26, 2024

BrianEllwood commented Apr 29, 2024

BrianEllwood commented May 2, 2024

BrianEllwood commented May 2, 2024

bcrawford-moj commented May 2, 2024

BrianEllwood commented May 2, 2024

AntFMoJ commented May 3, 2024

BrianEllwood commented May 9, 2024

AntFMoJ commented May 13, 2024 • edited Loading

jacobwoffenden commented May 16, 2024

jacobwoffenden commented May 16, 2024 • edited Loading

This comment has been minimized.

BrianEllwood commented May 23, 2024

bcrawford-moj commented May 23, 2024

AntFMoJ commented May 24, 2024 • edited Loading

bcrawford-moj commented May 24, 2024

yznlp commented Jul 12, 2024

BrianEllwood commented Jul 12, 2024

yznlp commented Jul 12, 2024

bcrawford-moj commented Apr 19, 2024 •

edited by BrianEllwood

Loading

AntFMoJ commented May 13, 2024 •

edited

Loading

jacobwoffenden commented May 16, 2024 •

edited

Loading

AntFMoJ commented May 24, 2024 •

edited

Loading