The standalone.py
script simulates the InstructLab workflow within a Kubernetes environment,
replicating the functionality of a KubeFlow Pipeline. This allows for distributed training and evaluation
of models without relying on centralized orchestration tools like KubeFlow.
The standalone.py
tool provides support for fetching generated SDG (Synthetic Data Generation) data from an AWS S3 compatible object store.
While AWS S3 is supported, alternative object storage solutions such as Ceph, Nooba, and MinIO are also compatible.
The standalone.py
script is designed to run within a Kubernetes environment. The following requirements must be met:
- A Kubernetes cluster with the necessary resources to run the InstructLab workflow.
- Nodes with GPUs available for training.
- KubeFlow training operator must be running and
PyTorchJob
CRD must be installed. - A StorageClass that supports dynamic provisioning with ReadWriteMany access mode.
- A Kubernetes configuration that allows access to the Kubernetes cluster.
- Both cluster and in-cluster configurations are supported.
- SDG data generated and uploaded to an object store.
Note
The script can be run outside of the cluster from the command-line, but it requires that the user is currently logged into a Kubernetes cluster.
Tip
Check the show
command to display an example of a Kubernetes Job that runs the script. Run ./standalone.py show
.
- Run any part of the InstructLab workflow in a standalone environment independently or a full end-to-end workflow:
- Fetch SDG data, model, and taxonomy from an object store with
sdg-data-fetch
subcommand.- Support for AWS S3 and compatible object storage solutions.
- Configure S3 details via CLI options, environment variables, or Kubernetes secret.
- Train model with
train
subcommand. - Evaluate model by running MT_Bench with
evaluation
subcommand along with--eval-type mt-bench
option. - Final model evaluation with
evaluation
subcommand along with--eval-type final
option.- Final evaluation runs both MT Bench_Branch and MMLU_Branch
- Push the final model back to the object store - same location as the SDG data with
upload-trained-model
subcommand.
- Fetch SDG data, model, and taxonomy from an object store with
Note
Read about InstructLab model evaluation in the instructlab/eval repository.
The standalone.py
script includes a main command to execute the full workflow, along with
subcommands to run individual parts of the workflow separately. The full workflow includes fetching SDG data from S3, training,
and evaluating a model. To view all available commands, use ./standalone.py --help
.
The script requires information regarding the location and method for accessing the SDG data/model/taxonomy tree and the evaluation Judge model serving endpoint. This information can be provided in two main ways:
- CLI Options or/and Environment Variables: Supply all necessary information via CLI options or environment variables.
- See CLI Options for full list. In particular
--sdg-object-store-*
and--judge-serving-model-*
options.
- See CLI Options for full list. In particular
- Kubernetes Secret: Provide the name of a Kubernetes secret that contains all relevant details
using the
--sdg-object-store-secret
option and--judge-serving-model-secret
option.- See Creating the Kubernetes Secret for S3 Details for information on how to create the secret.
The examples below assume there is a secret in my-namespace
named sdg-data
that holds
information about the S3 bucket and judge-serving-details
secret that includes information about
the judge server model. A judge server model is assumed to run external to the script. See Judge
Model Details for required information.
./standalone.py run \
--namespace my-namespace \
--judge-serving-model-secret judge-serving-details \
--sdg-object-store-secret sdg-data
Now let's say you only want to fetch the SDG data, you can use the sdg-data-fetch
subcommand:
./standalone.py run sdg-data-fetch \
--namespace my-namespace \
--judge-serving-model-secret judge-serving-details \
--sdg-object-store-secret sdg-data
Other subcommands are available to run the training and evaluation steps:
train
evaluation
Caution
CLI options MUST be positioned AFTER the run
command and BEFORE any subcommands.
--namespace
: The namespace in which the Kubernetes resources are located - Required--storage-class
: The storage class to use for the PVCs - Optional - Default: cluster default storage class.--nproc-per-node
: Number of GPU to use per node - for training only - Optional - Default: 1.--sdg-object-store-secret
: The name of the Kubernetes secret containing the SDG object store credentials. Optional - If not provided, the script will expect the provided CLI options to fetch the SDG data.--sdg-object-store-endpoint
: The endpoint of the object store.SDG_OBJECT_STORE_ENDPOINT
environment variable can be used as well. Optional--sdg-object-store-bucket
: The bucket name in the object store.SDG_OBJECT_STORE_BUCKET
environment variable can be used as well. Required - If--sdg-object-store-secret
is not provided.--sdg-object-store-access-key
: The access key for the object store.SDG_OBJECT_STORE_ACCESS_KEY
environment variable can be used as well. Required - If--sdg-object-store-secret
is not provided.--sdg-object-store-secret-key
: The secret key for the object store.SDG_OBJECT_STORE_SECRET_KEY
environment variable can be used as well. Required - If--sdg-object-store-secret
is not provided.--sdg-object-store-data-key
: The key for the SDG data in the object store. e.g.,data.tar.gz
.SDG_OBJECT_STORE_DATA_KEY
environment variable can be used as well. Required - If--sdg-object-store-secret
is not provided.--sdg-object-store-verify-tls
: Whether to verify TLS for the object store endpoint (default: true).SDG_OBJECT_STORE_VERIFY_TLS
environment variable can be used as well. Optional--sdg-object-store-region
: The region of the object store.SDG_OBJECT_STORE_REGION
environment variable can be used as well. Optional--judge-serving-model-endpoint
: Serving endpoint for evaluation. e.g: http://serving.kubeflow.svc.cluster.local:8080/v1 - Optional--judge-serving-model-name
: The name of the model to use for evaluation. Optional--judge-serving-model-api-key
: The API key for the model to evaluate.JUDGE_SERVING_MODEL_API_KEY
environment variable can be used as well. Optional--judge-serving-model-secret
: The name of the Kubernetes secret containing the judge serving model API key. Optional - If not provided, the script will expect the provided CLI options to evaluate the model.--force-pull
: Force pull the data (sdg data, model and taxonomy) from the object store even if it already exists in the PVC. Optional - Default: false.--training-1-epoch-num
: The number of epochs to train the model for phase 1. Optional - Default: 7.--training-2-epoch-num
: The number of epochs to train the model for phase 2. Optional - Default: 10.--eval-type
: The evaluation type to use. Optional - Default:mt-bench
. Available options:mt-bench
,final
.
The following example demonstrates how to generate SDG data, package it as a tarball, and upload it
to an object store. This assumes that AWS CLI is installed and configured with the necessary
credentials.
In this scenario the name of the bucket is sdg-data
and the tarball file is data.tar.gz
.
ilab data generate
mv generated data
tar -czvf data.tar.gz data model taxonomy
aws cp data.tar.gz s3://sdg-data/data.tar.gz
Caution
SDG data must exist in a directory called "data".
The model to train must exist in a directory called "model".
The taxonomy tree used to generate the SDG data must exist in a directory called "taxonomy".
The tarball must contain three top-level directories: data
, model
and taxonomy
.
Caution
The tarball format must be .tar.gz
.
Alternatively, you can use the standalone/sdg-data-on-s3.py script to upload the SDG data to the object store.
./sdg-data-on-s3.py upload \
--object-store-bucket sdg-data \
--object-store-access-key $ACCESS_KEY \
--object-store-secret-key $SECRET_KEY \
--sdg-data-archive-file-path data.tar.gz
Run ./sdg-data-on-s3.py upload --help
to see all available options.
The simplest method to supply the script with the required information for retrieving SDG data is by
creating a Kubernetes secret. In the example below, we create a secret called sdg-data
within the
my-namespace
namespace, containing the necessary credentials. Ensure that you update the access
key and secret key as needed. The data_key
field refers to the name of the tarball file in the
object store that holds the SDG data. In this case, it's named data.tar.gz
, as we previously
uploaded the tarball to the object store using this name.
cat <<EOF | kubectl create -f -
apiVersion: v1
kind: Secret
metadata:
name: sdg-data
namespace: my-namespace
type: Opaque
stringData:
bucket: sdg-data
access_key: *****
secret_key: *****
data_key: data.tar.gz
EOF
Warning
The secret must be part of the same namespace as the resources that the script interacts with.
It's inherented from the --namespace
option.
The list of all supported keys:
bucket
: The bucket name in the object store - Requiredaccess_key
: The access key for the object store - Requiredsecret_key
: The secret key for the object store - Requireddata_key
: The key for the SDG data in the object store - Requiredverify_tls
: Whether to verify TLS for the object store endpoint (default: true) - Optionalendpoint
: The endpoint of the object store, e.g: https://s3.openshift-storage.svc:443 - Optionalregion
: The region of the object store - Optional
A similar operation can be performed for the evaluation judge model serving service. Currently, the script expects the Judge serving service to be running and accessible from within the cluster. If it is not present, the script will not create this resource.
cat <<EOF | kubectl create -f -
apiVersion: v1
kind: Secret
metadata:
name: judge-serving-details
namespace: my-namespace
type: Opaque
stringData:
JUDGE_API_KEY: ********
JUDGE_ENDPOINT: https://mistral-sallyom.apps.ocp-beta-test.nerc.mghpcc.org/v1
JUDGE_NAME: mistral
EOF
The list of all mandatory keys:
JUDGE_API_KEY
: The API key for the model to evaluate - RequiredJUDGE_ENDPOINT
: Serving endpoint for evaluation - RequiredJUDGE_NAME
: The name of the model to use for evaluation - Required
Warning
Mind the upper case of the keys, as the script expects them to be in upper case.
Alternatively, you can provide the necessary information directly via CLI options or environment,
the script will use the provided information to fetch the SDG data and create its own Kubernetes
Secret named sdg-object-store-credentials
in the same namespace as the resources it interacts with (in this case, my-namespace
).
export JUDGE_SERVING_MODEL_API_KEY=********
./standalone.py run \
--namespace my-namespace \
--judge-serving-model-endpoint http://serving.kubeflow.svc.cluster.local:8080/v1 \
--judge-serving-model-name my-model \
--sdg-object-store-access-key key \
--sdg-object-store-secret-key key \
--sdg-object-store-bucket sdg-data \
--sdg-object-store-data-key data.tar.gz
A judge model is assumed to be running external to the script. This is used for model evaluation.
- The
--judge-serving-model-endpoint
and--judge-serving-model-name
values will be stored in a ConfigMap namedjudge-serving-details
in the same namespace as the resources that the script interacts with. --judge-serving-model-api-key
or environment variableJUDGE_SERVING_MODEL_API_KEY
value will be stored in a secret namedjudge-serving-details
in the same namespace as the resources that the script interacts with.- In all examples, the
JUDGE_SERVING_MODEL_API_KEY
environment variable is exported rather than setting the CLI option.
If you don't use the official AWS S3 endpoint, you can provide additional information about the object store:
export JUDGE_SERVING_MODEL_API_KEY=********
./standalone.py run \
--namespace my-namespace \
--judge-serving-model-endpoint http://serving.kubeflow.svc.cluster.local:8080/v1 \
--judge-serving-model-name my-model \
--sdg-object-store-access-key key \
--sdg-object-store-secret-key key \
--sdg-object-store-bucket sdg-data \
--sdg-object-store-data-key data.tar.gz \
--sdg-object-store-verify-tls false \
--sdg-object-store-endpoint https://s3.openshift-storage.svc:443
Important
The --sdg-object-store-endpoint
option must be provided in the format
scheme://host:<port>
, the port can be omitted if it's the default port.
Tip
If you don't want to run the entire workflow and only want to fetch the SDG data, you can use the run sdg-data-fetch
command