Insufficient Slots for MPIJob with 2 Worker Pods and 2 Gaudi Cards Each #680

gera-aldama · 2025-01-24T22:32:21Z

Based on Multi-Gaudi Workloads Example, I am trying to run an MPIJob with the following configuration:

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: mpijob
spec:
  slotsPerWorker: 2
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          hostIPC: true
          containers:
            - name: mpijob-container
              image: "vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest"
              imagePullPolicy: Always
              command: ["/bin/bash", "-c"]
              args:
                - >-
                  /usr/bin/ssh-keygen -A;
                  /usr/sbin/sshd;

                  HOSTSFILE=$OMPI_MCA_orte_default_hostfile;
                  echo "HOSTSFILE=${HOSTSFILE}";
                  MASTER_ADDR="$(head -n 1 $HOSTSFILE | sed -n s/[[:space:]]slots.*//p)";
                  echo "MASTER_ADDR=${MASTER_ADDR}";
                  NUM_NODES=$(wc -l < $HOSTSFILE);
                  echo "NUM_NODES=${NUM_NODES}";
                  CARDS_PER_NODE=2;
                  N_CARDS=$((NUM_NODES*CARDS_PER_NODE));
                  echo "N_CARDS=${N_CARDS}";

                  SETUP_CMD="git clone --single-branch --branch v1.15.0 https://github.com/huggingface/optimum-habana.git; \
                             pip install -r optimum-habana/examples/language-modeling/requirements.txt; \
                             pip install --no-cache-dir optimum-habana==1.15.0; \
                             pip install --no-cache-dir git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0";

                  eval $SETUP_CMD;

                  mpirun --npernode 1 \
                     --tag-output \
                     --allow-run-as-root \
                     --prefix $MPI_ROOT \
                     -mca routed direct \
                     git clone --single-branch --branch v1.15.0 https://github.com/huggingface/optimum-habana.git;

                  mpirun --npernode 1 \
                     --tag-output \
                     --allow-run-as-root \
                     --prefix $MPI_ROOT \
                     -mca routed direct \
                     pip install -r optimum-habana/examples/language-modeling/requirements.txt;

                  mpirun --npernode 1 \
                     --tag-output \
                     --allow-run-as-root \
                     --prefix $MPI_ROOT \
                     -mca routed direct \
                     pip install --no-cache-dir optimum-habana==1.15.0;

                  mpirun --npernode 1 \
                     --tag-output \
                     --allow-run-as-root \
                     --prefix $MPI_ROOT \
                     -mca routed direct \
                     pip install --no-cache-dir git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0;

                  MODEL_PATH=/optimum-habana/examples/language-modeling;
                  cd $MODEL_PATH;
                  mpirun -np ${N_CARDS} \
                    --allow-run-as-root \
                    --bind-to core \
                    --map-by ppr:4:socket:PE=6 \
                    -rank-by core --report-bindings \
                    --tag-output \
                    --merge-stderr-to-stdout --prefix $MPI_ROOT \
                    -x MASTER_ADDR=$MASTER_ADDR \
                    -mca btl_tcp_if_include eth0 \
                    -mca oob_tcp_if_include eth0 \
                    -mca plm_rsh_no_tree_spawn 1 \
                    python $MODEL_PATH/run_lora_clm.py \
                    --model_name_or_path huggyllama/llama-7b \
                    --dataset_name tatsu-lab/alpaca \
                    --bf16 \
                    --output_dir /tmp/pvc-mount \
                    --num_train_epochs 1 \
                    --per_device_train_batch_size 12 \
                    --evaluation_strategy no \
                    --save_strategy no \
                    --learning_rate 1e-4 \
                    --warmup_ratio 0.03 \
                    --lr_scheduler_type constant \
                    --max_grad_norm 0.3 \
                    --logging_steps 1 \
                    --do_train \
                    --do_eval \
                    --use_habana \
                    --use_lazy_mode \
                    --throughput_warmup_steps 3 \
                    --lora_rank 8 \
                    --lora_alpha 16 \
                    --lora_dropout 0.05 \
                    --lora_target_modules q_proj v_proj \
                    --dataset_concatenation \
                    --max_seq_length 512 \
                    --low_cpu_mem_usage True \
                    --validation_split_percentage 4 \
                    --adam_epsilon 1e-08;
              resources:
                limits:
                  cpu: 16
                  memory: 64Gi
                  hugepages-2Mi: 4400Mi
                requests:
                  cpu: 16
                  memory: 64Gi
                  hugepages-2Mi: 4400Mi
              volumeMounts:
                - name: hf-token
                  mountPath: /tmp/hf_token
                - name: pvc-storage
                  mountPath: /tmp/pvc-mount
          volumes:
            - name: hf-token
              secret:
                secretName: hf-token
            - name: pvc-storage
              persistentVolumeClaim:
                claimName: pvc-storage
    Worker:
      replicas: 2
      template:
        spec:
          hostIPC: true
          containers:
            - name: mpijob-container
              image: "vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest"
              imagePullPolicy: Always
              command: ["/bin/bash", "-c"]
              args:
                - >-
                  /usr/bin/ssh-keygen -A;
                  /usr/sbin/sshd;
                  sleep 365d;
              resources:
                limits:
                  habana.ai/gaudi: 2
                  cpu: 16
                  memory: 64Gi
                  hugepages-2Mi: 4400Mi
                requests:
                  habana.ai/gaudi: 2
                  cpu: 16
                  memory: 64Gi
                  hugepages-2Mi: 4400Mi
              volumeMounts:
                - name: hf-token
                  mountPath: /tmp/hf_token
                - name: pvc-storage
                  mountPath: /tmp/pvc-mount
          volumes:
            - name: hf-token
              secret:
                secretName: hf-token
            - name: pvc-storage
              persistentVolumeClaim:
                claimName: pvc-storage

When I run this configuration, I encounter the following error:

There are not enough slots available in the system to satisfy the 4 slots that were requested by the application: python Either request fewer slots for your application, or make more slots available for use.

Observations:

The example works fine when using either 1 worker pod with 2 Gaudi cards, or 2 worker pods with 1 Gaudi card each.
Using the --oversubscribe flag results in the following error:

RuntimeError: synStatus=8 [Device not found] Device acquire failed.

tenzen-y · 2025-02-13T18:33:49Z

How much does your cluster and each node have accelerators?
Additionally, do you want to perform OpenMPI by intel accelerators instead of IntelMPI?

@eero-t Could you help with this?

eero-t · 2025-02-14T10:17:39Z

@eero-t Could you help with this?

It's couple of years since I used MPI jobs, but I have same question as you; are the requested amounts of resources actually available on the nodes? (PVC volume claims, Gaudi devices, CPU cores, memory, hugepages...)

PS. I think doing git clones & pip installs on each Job start, instead of just having an image with all that already prepared (in local registry), is rather horrible... :-)

gera-aldama · 2025-02-19T00:23:54Z

Hello @tenzen-y , @eero-t Thanks for replying.

We have a cluster with 2 worker nodes, each with 8 Gaudi cards.

Allocatable:
  cpu:                159530m
  ephemeral-storage:  836227294665
  habana.ai/gaudi:    8
  hugepages-1Gi:      0
  hugepages-2Mi:      35202Mi
  memory:             1018724988Ki
  pods:               110
---
Allocatable:
  cpu:                159530m
  ephemeral-storage:  836227294665
  habana.ai/gaudi:    8
  hugepages-1Gi:      0
  hugepages-2Mi:      35202Mi
  memory:             1018724920Ki
  pods:               110

$ kubectl top nodes
NAME                  CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ng-nx6n7s4yyi-6f812   4238m        2%     438493Mi        44%
ng-nx6n7s4yyi-d2886   4339m        2%     338358Mi        34%

Other configurations we've tested are without using the -np option on the mpirun command, which supposedly it should use all available slots across the worker nodes, but we get the following:

--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 8
slots that were requested by the application:

  python

Either request fewer slots for your application, or make more slots
available for use.

It is overestimating by a factor of 2 (only 4 slots should be available but 8 is calculated).

Trying with slotsPerWorker: 4, habana.ai/gaudi: 4 & -np the result is:

======================   ALLOCATED NODES   ======================
        mpijob-worker-0.mpijob.<namespace>.svc: flags=0x13 slots=4 max_slots=0 slots_inuse=0 state=UNKNOWN
        mpijob-worker-1.mpijob.<namespace>.svc: flags=0x13 slots=4 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 8
slots that were requested by the application:

  python

Either request fewer slots for your application, or make more slots
available for use.

Even though there are 8 cards available within the two nodes. Setting 8 cards seems to have a different effect, giving a daemon related error:

[1,0]<stderr>:[INFO|modeling_utils.py:1622] 2025-02-12 17:46:30,321 >> Instantiating GaudiLlamaForCausalLM model under default dtype torch.bfloat16.
[1,0]<stderr>:[INFO|configuration_utils.py:1099] 2025-02-12 17:46:30,322 >> Generate config GaudiGenerationConfig {
[1,0]<stderr>:  "bos_token_id": 1,
[1,0]<stderr>:  "eos_token_id": 2,
[1,0]<stderr>:  "pad_token_id": 0
[1,0]<stderr>:}
[1,0]<stderr>:
Downloading shards: 100%|██████████| 2/2 [01:30<00:00, 45.40s/it]
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[9729,0],0] on node mpijob-launcher
  Remote daemon: [[9729,0],1] on node mpijob-worker-0.mpijob.<namespace>.svc

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

We want to and are using the MPI version mpirun (Open MPI) 4.1.6. And the MPI Operator version is 0.5.0.

Regarding the usage of a customized image instead of git & pip commands on the job, I agree, it'd be better and we can work on the new image, it's just that we've been rather focusing on enabling the multi-node-multi-card MPIJob.

Let me know if there's any more info you want me to provide.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Insufficient Slots for MPIJob with 2 Worker Pods and 2 Gaudi Cards Each #680

Insufficient Slots for MPIJob with 2 Worker Pods and 2 Gaudi Cards Each #680

gera-aldama commented Jan 24, 2025

tenzen-y commented Feb 13, 2025

eero-t commented Feb 14, 2025

gera-aldama commented Feb 19, 2025 •

edited

Loading

Insufficient Slots for MPIJob with 2 Worker Pods and 2 Gaudi Cards Each #680

Insufficient Slots for MPIJob with 2 Worker Pods and 2 Gaudi Cards Each #680

Comments

gera-aldama commented Jan 24, 2025

tenzen-y commented Feb 13, 2025

eero-t commented Feb 14, 2025

gera-aldama commented Feb 19, 2025 • edited Loading

gera-aldama commented Feb 19, 2025 •

edited

Loading