Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add post processing logic to accelerate launch #346

Closed
wants to merge 31 commits into from

Conversation

willmj
Copy link
Collaborator

@willmj willmj commented Sep 24, 2024

Description of the change

Add post processing logic from PR #338 to accelerate launch, with unit tests and documentation

Related issue number

How to verify the PR

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

Ssukriti and others added 28 commits September 10, 2024 17:16
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Angel Luu <angel.luu@us.ibm.com>
Signed-off-by: Angel Luu <angel.luu@us.ibm.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
…ss_LoRA

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
* get num_added_tokens

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* remove extra code

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

---------

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Angel Luu <angel.luu@us.ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
* refactor saving tokens metadata

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* remove extra check

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* post processing script

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* post processing script

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* fix: unit test args

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

* undo post_process_vLLm flag

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>

---------

Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@github-actions github-actions bot added the feat label Sep 24, 2024
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
@willmj
Copy link
Collaborator Author

willmj commented Sep 24, 2024

It works for llama3-8b!

Tuning and inference details

Tuning config:

apiVersion: v1
kind: ConfigMap
metadata:
  name: sft-trainer-config-will-l38b
data:
  config.json: |
      {
          "model_name_or_path": "/llama3_eval/hf/8b_pre_trained",
          "training_data_path": "/testing/tuning/input/twitter-complaints.json",
          "output_dir": "/testing/tuning/output/llama3-8b_pre_trained/lora/${NOW}",
          "save_model_dir": "/testing/tuning/output/llama3-8b_pre_trained/lora/${NOW}-save_model",
          "num_train_epochs": 10.0,
          "per_device_train_batch_size": 2,
          "gradient_accumulation_steps": 1,
          "learning_rate": 1e-4,
          "response_template": "\n### Label:",
          "dataset_text_field": "output",
          "gradient_checkpointing": true,
          "peft_method": "lora",
          "r": 8,
          "lora_alpha": 16,
          "lora_dropout": 0.05,
          "target_modules": ["all-linear"],
          "lora_post_process_for_vllm": true
      }
---
apiVersion: v1
kind: Pod
metadata:
  name: will-sft-trainer-llama3-8b
spec:
  securityContext:
    runAsUser: 0
    runAsGroup: 0
    fsGroup: 1000
    fsGroupChangePolicy: "OnRootMismatch"
  containers:
    - env:
        - name: SFT_TRAINER_CONFIG_JSON_PATH
          value: /config/config.json
      image: docker-na-public.artifactory.swg-devops.com/wcp-ai-foundation-team-docker-virtual/sft-trainer:anhdev_ubi9_py311
      imagePullPolicy: IfNotPresent
      command: [ "/bin/bash", "-c", "--" ]
      args: [ "while true; do sleep 30; done;" ]
      name: train-conductor-training
      resources:
        limits:
          nvidia.com/gpu: "2"
          memory: 200Gi
          cpu: "10"
          ephemeral-storage: 2Ti
        requests:
          nvidia.com/gpu: "2"
          memory: 80Gi
          cpu: "5"
          ephemeral-storage: 1600Gi
      volumeMounts:
        - mountPath: /testing
          name: testing-bucket
        - mountPath: /llama3_eval
          name: llama-3-pvc
          readOnly: true
        - mountPath: /granite
          name: granite-pvc
          readOnly: true
        - mountPath: /config
          name: sft-trainer-config 
        - mountPath: /data
          name: input-data
  imagePullSecrets:
    - name: artifactory-docker-anh
  restartPolicy: Never
  terminationGracePeriodSeconds: 30
  volumes:
    - name: testing-bucket
      persistentVolumeClaim:
         claimName: fmaas-integration-tests-pvc
    - name: granite-pvc
      persistentVolumeClaim:
         claimName: granite-pvc
    - name: llama-3-pvc
      persistentVolumeClaim:
         claimName: llama-3-pvc
    - name: input-data
      persistentVolumeClaim:
         claimName: fms-tuning-pvc
    - name: sft-trainer-config
      configMap:
         name: sft-trainer-config-will-l38b

Inference config:

apiVersion: v1
kind: Service
metadata:
  labels:
    app: text-gen-llama3-8b-pre-trained
  name: llama3-8b-pre-trained-inference-server
spec:
  clusterIP: None
  ports:
  - name: grpc
    port: 8033
    targetPort: grpc
  selector:
    app: text-gen-llama3-8b-pre-trained
  type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: text-gen-llama3-8b-pre-trained
    component: fmaas-inference-server
  name: llama3-8b-pre-trained-inference-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: text-gen-llama3-8b-pre-trained
      component: fmaas-inference-server
  strategy:
    rollingUpdate:
      maxSurge: 1
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics/
        prometheus.io/port: "3000"
        prometheus.io/scrape: "true"
      labels:
        app: text-gen-llama3-8b-pre-trained
        component: fmaas-inference-server
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: nvidia.com/gpu.product
                operator: In
                values:
                - NVIDIA-A100-SXM4-80GB
      containers:
      - env:
        - name: MODEL_NAME
          value: "/llama3/hf/8b_pre_trained/"
        - name: OUTPUT_SPECIAL_TOKENS
          value: "true"
        - name: MAX_NEW_TOKENS
          value: "4096"
        - name: DEPLOYMENT_FRAMEWORK
          value: tgis_native
        - name: FLASH_ATTENTION
          value: "true"
        - name: NUM_GPUS
          value: "1"
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        - name: PORT
          value: "3000"
        - name: MAX_LOG_LEN
          value: "100"
        - name: ENABLE_LORA
          value: "true"
        - name: ADAPTER_CACHE
          value: "/testing/tuning/output/llama3-8b_pre_trained/lora/"
        # had to update from shared_model_storage so was writeable for model from HF
        - name: HF_HUB_CACHE
          value: /tmp
        - name: TRANSFORMERS_CACHE
          value: $(HF_HUB_CACHE)
        # The below values may vary by model, this is taken for granite-13b
        - name: MAX_BATCH_SIZE
          value: "256"
        - name: MAX_CONCURRENT_REQUESTS
          value: "256"
        # Below is added for granite-3b-code-instruct model
        # - name: VLLM_ATTENTION_BACKEND
        #   value: XFORMERS
        # to download model from HF add below
        # - name: HF_HUB_OFFLINE
        #   value: "0" 
        image: quay.io/opendatahub/vllm:fast-ibm-881aaba  # released Sep 4, 2024
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: http
            scheme: HTTP
          periodSeconds: 100
          successThreshold: 1
          timeoutSeconds: 8
        name: server
        ports:
        - containerPort: 3000
          name: http
          protocol: TCP
        - containerPort: 8033
          name: grpc
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: http
            scheme: HTTP
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 5
        # resources will vary by model -- taken for granite-13b
        resources:
          limits:
            cpu: "8"
            memory: 48Gi
            nvidia.com/gpu: "1"
          requests:
            cpu: "4"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          privileged: false
          runAsNonRoot: true
          seccompProfile:
            type: RuntimeDefault
        startupProbe:
          failureThreshold: 24
          httpGet:
            path: /health
            port: http
            scheme: HTTP
          periodSeconds: 30
          successThreshold: 1
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /granite
          name: ibm-granite-pvc
          readOnly: true
        - name: llama-eval-pvc
          mountPath: /llama
          readOnly: true
        - name: llama-3-pvc
          mountPath: /llama3
          readOnly: true
        # - mountPath: /data
        #   name: fms-tuning
        #   readOnly: true
        - mountPath: /testing
          name: fmaas-integration-tests
          readOnly: true
      dnsPolicy: ClusterFirst
      enableServiceLinks: false
      # imagePullSecrets:
      # - name: artifactory-docker
      volumes:
      - name: ibm-granite-pvc
        persistentVolumeClaim:
          claimName: ibm-granite-pvc
      - name: llama-eval-pvc
        persistentVolumeClaim:
          claimName: llama-eval-pvc
      - name: llama-3-pvc
        persistentVolumeClaim:
          claimName: llama-3-pvc
      # - name: fms-tuning
      #   persistentVolumeClaim:
      #     claimName: fms-tuning-pvc
      - name: fmaas-integration-tests
        persistentVolumeClaim:
          claimName: fmaas-integration-tests
 grpcurl -plaintext -proto ./proto/generation.proto -d "{\"adapter_id\": \"20240924154416-save_model\",\"params\":{\"method\":\"GREEDY\", \"stopping\": {\"max_new_tokens\": 128}}, \"requests\": [{\"text\":\"### Text: @sho_help @showtime your arrive is terrible streaming is stop and start every couple mins. Get it together it's xmas\n\n### Label:\"}, {\"text\":\"### Text: @FitbitSupport when are you launching new clock faces for Indian market\n\n### Label:\"}]}" localhost:8033 fmaas.GenerationService/Generate
{
  "responses": [
    {
      "generatedTokenCount": 2,
      "text": " complaint\u003c|end_of_text|\u003e",
      "inputTokenCount": 34,
      "stopReason": "EOS_TOKEN",
      "stopSequence": "\u003c|end_of_text|\u003e"
    },
    {
      "generatedTokenCount": 3,
      "text": " no complaint\u003c|end_of_text|\u003e",
      "inputTokenCount": 21,
      "stopReason": "EOS_TOKEN",
      "stopSequence": "\u003c|end_of_text|\u003e"
    }
  ]
}

@willmj
Copy link
Collaborator Author

willmj commented Sep 24, 2024

image

Copy link
Collaborator

@anhuong anhuong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comments on fixes, Will is currently testing, otherwise LGTM

build/accelerate_launch.py Outdated Show resolved Hide resolved
build/accelerate_launch.py Outdated Show resolved Hide resolved
build/accelerate_launch.py Outdated Show resolved Hide resolved
@@ -38,14 +38,16 @@ For example, the below config is used for running with two GPUs and FSDP for fin
"per_device_train_batch_size": 4,
"learning_rate": 1e-5,
"response_template": "\n### Label:",
"dataset_text_field": "output"
"dataset_text_field": "output",
"lora_post_process_for_vllm": true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be interested to hear if @Ssukriti has thoughts on this param but looks good to me and thanks for updating these docs!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya sounds good

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
anhuong
anhuong previously approved these changes Sep 25, 2024
Copy link
Collaborator

@anhuong anhuong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@anhuong anhuong changed the base branch from utility_to_post-process_LoRA to main September 25, 2024 21:17
@anhuong anhuong dismissed their stale review September 25, 2024 21:17

The base branch was changed.

@anhuong anhuong changed the base branch from main to utility_to_post-process_LoRA September 25, 2024 21:18
Signed-off-by: Anh Uong <anh.uong@ibm.com>
@anhuong anhuong changed the base branch from utility_to_post-process_LoRA to main September 25, 2024 21:41
@anhuong
Copy link
Collaborator

anhuong commented Sep 25, 2024

adding changes from main didn't resolve extra commits here although the files changed are correct. rebasing was taking a long time and hard to validate so we moved over to a new PR #351

@anhuong anhuong closed this Sep 25, 2024
@willmj willmj deleted the post_process_accelerate branch September 25, 2024 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants