-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Insufficient Slots for MPIJob with 2 Worker Pods and 2 Gaudi Cards Each #680
Comments
How much does your cluster and each node have accelerators? @eero-t Could you help with this? |
It's couple of years since I used MPI jobs, but I have same question as you; are the requested amounts of resources actually available on the nodes? (PVC volume claims, Gaudi devices, CPU cores, memory, hugepages...) PS. I think doing git clones & pip installs on each Job start, instead of just having an image with all that already prepared (in local registry), is rather horrible... :-) |
Hello @tenzen-y , @eero-t Thanks for replying. We have a cluster with 2 worker nodes, each with 8 Gaudi cards. Allocatable:
cpu: 159530m
ephemeral-storage: 836227294665
habana.ai/gaudi: 8
hugepages-1Gi: 0
hugepages-2Mi: 35202Mi
memory: 1018724988Ki
pods: 110
---
Allocatable:
cpu: 159530m
ephemeral-storage: 836227294665
habana.ai/gaudi: 8
hugepages-1Gi: 0
hugepages-2Mi: 35202Mi
memory: 1018724920Ki
pods: 110
$ kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
ng-nx6n7s4yyi-6f812 4238m 2% 438493Mi 44%
ng-nx6n7s4yyi-d2886 4339m 2% 338358Mi 34% Other configurations we've tested are without using the --------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 8
slots that were requested by the application:
python
Either request fewer slots for your application, or make more slots
available for use. It is overestimating by a factor of 2 (only 4 slots should be available but 8 is calculated). Trying with ====================== ALLOCATED NODES ======================
mpijob-worker-0.mpijob.<namespace>.svc: flags=0x13 slots=4 max_slots=0 slots_inuse=0 state=UNKNOWN
mpijob-worker-1.mpijob.<namespace>.svc: flags=0x13 slots=4 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 8
slots that were requested by the application:
python
Either request fewer slots for your application, or make more slots
available for use. Even though there are 8 cards available within the two nodes. Setting 8 cards seems to have a different effect, giving a daemon related error: [1,0]<stderr>:[INFO|modeling_utils.py:1622] 2025-02-12 17:46:30,321 >> Instantiating GaudiLlamaForCausalLM model under default dtype torch.bfloat16.
[1,0]<stderr>:[INFO|configuration_utils.py:1099] 2025-02-12 17:46:30,322 >> Generate config GaudiGenerationConfig {
[1,0]<stderr>: "bos_token_id": 1,
[1,0]<stderr>: "eos_token_id": 2,
[1,0]<stderr>: "pad_token_id": 0
[1,0]<stderr>:}
[1,0]<stderr>:
Downloading shards: 100%|██████████| 2/2 [01:30<00:00, 45.40s/it]
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[9729,0],0] on node mpijob-launcher
Remote daemon: [[9729,0],1] on node mpijob-worker-0.mpijob.<namespace>.svc
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
-------------------------------------------------------------------------- We want to and are using the MPI version Regarding the usage of a customized image instead of git & pip commands on the job, I agree, it'd be better and we can work on the new image, it's just that we've been rather focusing on enabling the multi-node-multi-card MPIJob. Let me know if there's any more info you want me to provide. |
Based on Multi-Gaudi Workloads Example, I am trying to run an MPIJob with the following configuration:
When I run this configuration, I encounter the following error:
Observations:
The text was updated successfully, but these errors were encountered: