feat: Add post processing logic to accelerate launch #346

willmj · 2024-09-24T17:07:02Z

Description of the change

Add post processing logic from PR #338 to accelerate launch, with unit tests and documentation

Related issue number

How to verify the PR

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

Signed-off-by: Angel Luu <angel.luu@us.ibm.com>

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

…ss_LoRA Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

* get num_added_tokens Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * remove extra code Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> --------- Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

Signed-off-by: Angel Luu <angel.luu@us.ibm.com>

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

* refactor saving tokens metadata Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * remove extra check Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * post processing script Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * post processing script Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * fix: unit test args Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * undo post_process_vLLm flag Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> --------- Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

github-actions · 2024-09-24T17:07:14Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

willmj · 2024-09-24T20:33:37Z

It works for llama3-8b!

Tuning and inference details

Tuning config:

apiVersion: v1
kind: ConfigMap
metadata:
  name: sft-trainer-config-will-l38b
data:
  config.json: |
      {
          "model_name_or_path": "/llama3_eval/hf/8b_pre_trained",
          "training_data_path": "/testing/tuning/input/twitter-complaints.json",
          "output_dir": "/testing/tuning/output/llama3-8b_pre_trained/lora/${NOW}",
          "save_model_dir": "/testing/tuning/output/llama3-8b_pre_trained/lora/${NOW}-save_model",
          "num_train_epochs": 10.0,
          "per_device_train_batch_size": 2,
          "gradient_accumulation_steps": 1,
          "learning_rate": 1e-4,
          "response_template": "\n### Label:",
          "dataset_text_field": "output",
          "gradient_checkpointing": true,
          "peft_method": "lora",
          "r": 8,
          "lora_alpha": 16,
          "lora_dropout": 0.05,
          "target_modules": ["all-linear"],
          "lora_post_process_for_vllm": true
      }
---
apiVersion: v1
kind: Pod
metadata:
  name: will-sft-trainer-llama3-8b
spec:
  securityContext:
    runAsUser: 0
    runAsGroup: 0
    fsGroup: 1000
    fsGroupChangePolicy: "OnRootMismatch"
  containers:
    - env:
        - name: SFT_TRAINER_CONFIG_JSON_PATH
          value: /config/config.json
      image: docker-na-public.artifactory.swg-devops.com/wcp-ai-foundation-team-docker-virtual/sft-trainer:anhdev_ubi9_py311
      imagePullPolicy: IfNotPresent
      command: [ "/bin/bash", "-c", "--" ]
      args: [ "while true; do sleep 30; done;" ]
      name: train-conductor-training
      resources:
        limits:
          nvidia.com/gpu: "2"
          memory: 200Gi
          cpu: "10"
          ephemeral-storage: 2Ti
        requests:
          nvidia.com/gpu: "2"
          memory: 80Gi
          cpu: "5"
          ephemeral-storage: 1600Gi
      volumeMounts:
        - mountPath: /testing
          name: testing-bucket
        - mountPath: /llama3_eval
          name: llama-3-pvc
          readOnly: true
        - mountPath: /granite
          name: granite-pvc
          readOnly: true
        - mountPath: /config
          name: sft-trainer-config 
        - mountPath: /data
          name: input-data
  imagePullSecrets:
    - name: artifactory-docker-anh
  restartPolicy: Never
  terminationGracePeriodSeconds: 30
  volumes:
    - name: testing-bucket
      persistentVolumeClaim:
         claimName: fmaas-integration-tests-pvc
    - name: granite-pvc
      persistentVolumeClaim:
         claimName: granite-pvc
    - name: llama-3-pvc
      persistentVolumeClaim:
         claimName: llama-3-pvc
    - name: input-data
      persistentVolumeClaim:
         claimName: fms-tuning-pvc
    - name: sft-trainer-config
      configMap:
         name: sft-trainer-config-will-l38b

Inference config:

apiVersion: v1
kind: Service
metadata:
  labels:
    app: text-gen-llama3-8b-pre-trained
  name: llama3-8b-pre-trained-inference-server
spec:
  clusterIP: None
  ports:
  - name: grpc
    port: 8033
    targetPort: grpc
  selector:
    app: text-gen-llama3-8b-pre-trained
  type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: text-gen-llama3-8b-pre-trained
    component: fmaas-inference-server
  name: llama3-8b-pre-trained-inference-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: text-gen-llama3-8b-pre-trained
      component: fmaas-inference-server
  strategy:
    rollingUpdate:
      maxSurge: 1
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics/
        prometheus.io/port: "3000"
        prometheus.io/scrape: "true"
      labels:
        app: text-gen-llama3-8b-pre-trained
        component: fmaas-inference-server
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: nvidia.com/gpu.product
                operator: In
                values:
                - NVIDIA-A100-SXM4-80GB
      containers:
      - env:
        - name: MODEL_NAME
          value: "/llama3/hf/8b_pre_trained/"
        - name: OUTPUT_SPECIAL_TOKENS
          value: "true"
        - name: MAX_NEW_TOKENS
          value: "4096"
        - name: DEPLOYMENT_FRAMEWORK
          value: tgis_native
        - name: FLASH_ATTENTION
          value: "true"
        - name: NUM_GPUS
          value: "1"
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        - name: PORT
          value: "3000"
        - name: MAX_LOG_LEN
          value: "100"
        - name: ENABLE_LORA
          value: "true"
        - name: ADAPTER_CACHE
          value: "/testing/tuning/output/llama3-8b_pre_trained/lora/"
        # had to update from shared_model_storage so was writeable for model from HF
        - name: HF_HUB_CACHE
          value: /tmp
        - name: TRANSFORMERS_CACHE
          value: $(HF_HUB_CACHE)
        # The below values may vary by model, this is taken for granite-13b
        - name: MAX_BATCH_SIZE
          value: "256"
        - name: MAX_CONCURRENT_REQUESTS
          value: "256"
        # Below is added for granite-3b-code-instruct model
        # - name: VLLM_ATTENTION_BACKEND
        #   value: XFORMERS
        # to download model from HF add below
        # - name: HF_HUB_OFFLINE
        #   value: "0" 
        image: quay.io/opendatahub/vllm:fast-ibm-881aaba  # released Sep 4, 2024
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: http
            scheme: HTTP
          periodSeconds: 100
          successThreshold: 1
          timeoutSeconds: 8
        name: server
        ports:
        - containerPort: 3000
          name: http
          protocol: TCP
        - containerPort: 8033
          name: grpc
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: http
            scheme: HTTP
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 5
        # resources will vary by model -- taken for granite-13b
        resources:
          limits:
            cpu: "8"
            memory: 48Gi
            nvidia.com/gpu: "1"
          requests:
            cpu: "4"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          privileged: false
          runAsNonRoot: true
          seccompProfile:
            type: RuntimeDefault
        startupProbe:
          failureThreshold: 24
          httpGet:
            path: /health
            port: http
            scheme: HTTP
          periodSeconds: 30
          successThreshold: 1
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /granite
          name: ibm-granite-pvc
          readOnly: true
        - name: llama-eval-pvc
          mountPath: /llama
          readOnly: true
        - name: llama-3-pvc
          mountPath: /llama3
          readOnly: true
        # - mountPath: /data
        #   name: fms-tuning
        #   readOnly: true
        - mountPath: /testing
          name: fmaas-integration-tests
          readOnly: true
      dnsPolicy: ClusterFirst
      enableServiceLinks: false
      # imagePullSecrets:
      # - name: artifactory-docker
      volumes:
      - name: ibm-granite-pvc
        persistentVolumeClaim:
          claimName: ibm-granite-pvc
      - name: llama-eval-pvc
        persistentVolumeClaim:
          claimName: llama-eval-pvc
      - name: llama-3-pvc
        persistentVolumeClaim:
          claimName: llama-3-pvc
      # - name: fms-tuning
      #   persistentVolumeClaim:
      #     claimName: fms-tuning-pvc
      - name: fmaas-integration-tests
        persistentVolumeClaim:
          claimName: fmaas-integration-tests

 grpcurl -plaintext -proto ./proto/generation.proto -d "{\"adapter_id\": \"20240924154416-save_model\",\"params\":{\"method\":\"GREEDY\", \"stopping\": {\"max_new_tokens\": 128}}, \"requests\": [{\"text\":\"### Text: @sho_help @showtime your arrive is terrible streaming is stop and start every couple mins. Get it together it's xmas\n\n### Label:\"}, {\"text\":\"### Text: @FitbitSupport when are you launching new clock faces for Indian market\n\n### Label:\"}]}" localhost:8033 fmaas.GenerationService/Generate
{
  "responses": [
    {
      "generatedTokenCount": 2,
      "text": " complaint\u003c|end_of_text|\u003e",
      "inputTokenCount": 34,
      "stopReason": "EOS_TOKEN",
      "stopSequence": "\u003c|end_of_text|\u003e"
    },
    {
      "generatedTokenCount": 3,
      "text": " no complaint\u003c|end_of_text|\u003e",
      "inputTokenCount": 21,
      "stopReason": "EOS_TOKEN",
      "stopSequence": "\u003c|end_of_text|\u003e"
    }
  ]
}

willmj · 2024-09-24T20:33:41Z

anhuong

Small comments on fixes, Will is currently testing, otherwise LGTM

build/accelerate_launch.py

anhuong · 2024-09-24T20:38:27Z

build/README.md

@@ -38,14 +38,16 @@ For example, the below config is used for running with two GPUs and FSDP for fin
    "per_device_train_batch_size": 4,
    "learning_rate": 1e-5,
    "response_template": "\n### Label:",
-    "dataset_text_field": "output"
+    "dataset_text_field": "output",
+    "lora_post_process_for_vllm": true


Would be interested to hear if @Ssukriti has thoughts on this param but looks good to me and thanks for updating these docs!

ya sounds good

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

anhuong

LGTM

The base branch was changed.

Signed-off-by: Anh Uong <anh.uong@ibm.com>

anhuong · 2024-09-25T21:59:19Z

adding changes from main didn't resolve extra commits here although the files changed are correct. rebasing was taking a long time and hard to validate so we moved over to a new PR #351

Ssukriti and others added 28 commits September 10, 2024 17:16

utilities to post process checkpoint for LoRA

fa42c73

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

Merge branch 'main' into utility_to_post-process_LoRA

e5e4c27

improve code comments

0fa3dac

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

Add unit test and fix some lint errors

fa97871

Signed-off-by: Angel Luu <angel.luu@us.ibm.com>

lint: fix more fmt errors

4c9bb95

Signed-off-by: Angel Luu <angel.luu@us.ibm.com>

feat: Add post_process_vLLM_adapters_new_tokens function to main

af191d1

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

Merge remote-tracking branch 'origin/main' into utility_to_post-proce…

fb1dcc9

…ss_LoRA Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fmt

bcc17b1

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fix: Add post processing flag so post processing is only done for vLLM

57cadc3

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fix: get num_added_tokens from resize function (#344)

36a554c

* get num_added_tokens Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * remove extra code Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> --------- Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

Merge branch 'main' into utility_to_post-process_LoRA

0d34b1f

Ran fmt and also removed unneccessary files from test artifact

4380c5b

Signed-off-by: Angel Luu <angel.luu@us.ibm.com>

fix: unit tests

146e9f1

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

fix: Adding tokens in special_tokens_dict

0022da3

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

Merge branch 'main' into utility_to_post-process_LoRA

e6a2bc8

fix: Add additional arg to tests to reflect new flag post_process_vllm

0d077ea

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

fmt

c8d8f98

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

add test for LoRA tuning from main

fcdfa29

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

fix formatting

e5406e5

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

correcting post processing script

6ae8f36

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

fix:post-process in place

a93e902

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

update documentation for post-processing

7f864d0

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

fix:formatting

3de588a

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

fix:linting

f38361c

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

feat: Add post processing logic to accelerate launch

66ba0bc

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

docs: Add post processing arg and explanation

d5ca563

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

test: Add test for post processing in accelerate launch

2ccbd78

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

willmj requested review from anhuong and Ssukriti as code owners September 24, 2024 17:07

willmj requested a review from alex-jw-brooks as a code owner September 24, 2024 17:07

github-actions bot added the feat label Sep 24, 2024

fix: Remove comma from example

47c7b15

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

anhuong reviewed Sep 24, 2024

View reviewed changes

fix: small changes from review

5df031a

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

anhuong previously approved these changes Sep 25, 2024

View reviewed changes

anhuong changed the base branch from utility_to_post-process_LoRA to main September 25, 2024 21:17

anhuong changed the base branch from main to utility_to_post-process_LoRA September 25, 2024 21:18

merge changes from main

d63abd6

Signed-off-by: Anh Uong <anh.uong@ibm.com>

anhuong changed the base branch from utility_to_post-process_LoRA to main September 25, 2024 21:41

anhuong closed this Sep 25, 2024

willmj deleted the post_process_accelerate branch September 25, 2024 22:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add post processing logic to accelerate launch #346

feat: Add post processing logic to accelerate launch #346

willmj commented Sep 24, 2024

github-actions bot commented Sep 24, 2024

willmj commented Sep 24, 2024

willmj commented Sep 24, 2024

anhuong left a comment

anhuong Sep 24, 2024

Ssukriti Sep 25, 2024

anhuong left a comment

anhuong commented Sep 25, 2024

feat: Add post processing logic to accelerate launch #346

feat: Add post processing logic to accelerate launch #346

Conversation

willmj commented Sep 24, 2024

Description of the change

Related issue number

How to verify the PR

Was the PR tested

github-actions bot commented Sep 24, 2024

willmj commented Sep 24, 2024

willmj commented Sep 24, 2024

anhuong left a comment

Choose a reason for hiding this comment

anhuong Sep 24, 2024

Choose a reason for hiding this comment

Ssukriti Sep 25, 2024

Choose a reason for hiding this comment

anhuong left a comment

Choose a reason for hiding this comment

anhuong commented Sep 25, 2024