Introduce `truss train` subcommand, API stubs #1422

nnarayen · 2025-03-04T21:25:22Z

🚀 What

This PR adds the foundation for new truss_train functionality:

truss train push CLI stub to create a training project + job
framework code to import + validate training job definitions
pydantic types (not finalized)

💻 How

🔬 Testing

New unit tests
Sample truss train push with the following config:

from truss_train import definitions

runtime_config = definitions.RuntimeConfig(
    start_commands=["/bin/bash ./my-entrypoint.sh"],
    environment_variables={
        "FOO_VAR": "FOO_VAL",
        "BAR_VAR": definitions.SecretReference(name="BAR_SECRET"),
    },
)

training_job = definitions.TrainingJob(
    compute=definitions.Compute(node_count=1, cpu_count=4),
    runtime_config=runtime_config,
)

first_project = definitions.TrainingProject(name="first-project", job=training_job)

nnarayen · 2025-03-04T21:27:19Z

pyproject.toml

@@ -18,6 +18,7 @@ keywords = [
 packages = [
    { include = "truss", from = "." },
    { include = "truss_chains", from = "./truss-chains" },
+    { include = "truss_train", from = "./truss-train" },


@marius-baseten curious why this package structure was needed for chains? I think we can still use the import aliasing here without this, but I feel like there are other benefits I'm not aware of

I was not needed strictly, but since in the truss-subtree has all over the palce imports and chains had some different version requirements, so this extra dir was done as a salient way to keep up some isolation/structure between the to subtrees.

nnarayen · 2025-03-04T21:29:09Z

pyproject.toml

@@ -189,7 +190,7 @@ markers = [
 addopts = "--ignore=smoketests"

 [tool.ruff]
-src = ["truss", "truss-chains"]
+src = ["truss", "truss-chains", "truss-train"]


This allows truss-train imports to be labeled as first-party for import sorting. Note for followup, I think we actually want to specify . instead of truss here, since right now truss is being categorized as third party across the board

nnarayen · 2025-03-04T21:30:03Z

truss-chains/truss_chains/definitions.py

@@ -27,6 +27,7 @@
 import pydantic
 from truss.base import truss_config
 from truss.base.constants import PRODUCTION_ENVIRONMENT_NAME
+from truss.base.custom_types import SafeModel, SafeModelNonSerializable


There might be some shared paradigms between chains / train that aren't applicable to traditional truss. For now I'm opting to put into base, but open to suggestions on a better file structure

This code requires pydantiv v2, which was selectively required for chains only: https://github.com/basetenlabs/truss/blob/main/truss-chains/truss_chains/__init__.py#L3

Are we ready now to require v2 for the entire truss package? If so, we should remove those guards, update the project requirements etc...

I ended up reverting this for other reasons, but I think we can't upgrade to pydantic V2 until TaT ships, since we might break older trusses that are build with V1 unfortunately

nnarayen · 2025-03-04T21:33:04Z

truss-train/truss_train/definitions.py

+    name: str
+
+
+class Compute(SafeModel):


I don't think we need the extra layer like chains, and I'm not sure chains needs it either?

The layer was mainly for dealing with the conversion of gpu, but I believe this could be done with pydantic validators too.

One thing that is really important here: try to find a common denominator for the "new-style" APIs, so that we can re-use this for chains. In a way I'd even suggest putting these defs here into truss.base and then both train and chains can depend on it and use the same. It's a really bad user experience if these two products have "almost" the same APIs, but subtle differences, plus it's more upkeek and repo-clutter if we have so many definitions.

It was discussed in the API design/PRD doc to strive for this consolidation.

I'm happy to consolidate the definitions for compute, but I think after our simplifications that's the only one that makes sense? Unfortunately I believe env variables / secrets are different enough to warrant new APIs.

Chains can register secrets to be made available to user code, but there is no similar entrypoint in training. I think the best UX would be to have a clean way to define both traditional env vars and ones derived from secrets (as proposed here), and then the user can consume via bash / python scripts as needed.

nnarayen · 2025-03-04T21:49:33Z

truss/remote/baseten/api.py

@@ -566,3 +567,27 @@ def get_all_secrets(self) -> Any:

        secrets_info = resp.json()
        return secrets_info
+
+    def upsert_training_project(self, training_project):


I personally think the rest API structure should mirror the truss SDK very closely. The server code is already going to have to transform things (i.e. get the actual instance type, worker planes, user/org, etc), but we can keep the truss integration simple (basically a model_dump() here instead of explicit transformation code)

nnarayen · 2025-03-04T21:50:23Z

truss/remote/baseten/api.py

+
+        return resp.json()
+
+    def create_training_job(self, project_id: str, job):


Here (and in the remote wrapper) I ran into a circular dep issue when trying to annotate TrainingProject / TrainingJob because truss.base ends up depending on the api. I'm sure we can disentangle this in a followup, but I don't think it needs to block for now

nnarayen · 2025-03-04T21:57:59Z

truss-train/truss_train/definitions.py

+    accelerator: Optional[truss_config.AcceleratorSpec] = None
+
+
+class Runtime(SafeModel):


We can add client side validation code to these models in follow ups as well

nnarayen · 2025-03-04T21:58:58Z

truss-train/truss_train/loader.py

+from truss_train import definitions
+
+
+@contextlib.contextmanager


I know we just went over this in the offsite, but I actually think this code is different enough from chains to warrant the duplication, thoughts? Differences include:

no module modifications needed

targets aren't subclasses, but actual instances of TrainingProject

error messages

i think this looks pretty good, and agree with the thought you put into the decision. Could you explain a little bit on what the context manager helps with here?

In practice I don't think it's super necessary, was following the pattern from chains. The context manager allows us to defer execution to the block but ensure that we always perform cleanup code. Chains has more complicated import logic that tries to clean up modified modules, but we can probably get away without it for now

nnarayen · 2025-03-04T22:00:41Z

truss-train/truss_train/definitions.py

+    runtime: Runtime = Runtime()
+
+
+class TrainingProject(SafeModel):


I intentionally left off the blob stuff for now, I think it'll be easier to add that in targeted followups. I imagine we'll eventually zip up the directory and pass it through to server to upload to an S3 bucket of our choosing

rcano-baseten

LGTM! Interested to get Marius thoughts but I think this is pretty much in line with what I was thinking

rcano-baseten · 2025-03-05T04:11:51Z

truss-train/truss_train/loader.py

+from truss_train import definitions
+
+
+@contextlib.contextmanager


i think this looks pretty good, and agree with the thought you put into the decision. Could you explain a little bit on what the context manager helps with here?

rcano-baseten · 2025-03-05T04:12:33Z

truss/cli/cli.py

+
+
+@train.command(name="push")
+@click.argument("source", type=Path, required=True)


reason for source instead of config? I think we're closer to a config than any source code...

No strong reason, mainly (1) consistent w chains terminology (2) future proofing in case we extend to some way to have user written code provided. I'll switch to config for now since this is easy to change until we show to customers!

marius-baseten · 2025-03-05T18:16:38Z

pyproject.toml

@@ -18,6 +18,7 @@ keywords = [
 packages = [
    { include = "truss", from = "." },
    { include = "truss_chains", from = "./truss-chains" },
+    { include = "truss_train", from = "./truss-train" },


I was not needed strictly, but since in the truss-subtree has all over the palce imports and chains had some different version requirements, so this extra dir was done as a salient way to keep up some isolation/structure between the to subtrees.

marius-baseten · 2025-03-05T18:20:20Z

truss-chains/truss_chains/definitions.py

@@ -27,6 +27,7 @@
 import pydantic
 from truss.base import truss_config
 from truss.base.constants import PRODUCTION_ENVIRONMENT_NAME
+from truss.base.custom_types import SafeModel, SafeModelNonSerializable


This code requires pydantiv v2, which was selectively required for chains only: https://github.com/basetenlabs/truss/blob/main/truss-chains/truss_chains/__init__.py#L3

Are we ready now to require v2 for the entire truss package? If so, we should remove those guards, update the project requirements etc...

marius-baseten · 2025-03-05T18:25:27Z

truss-train/truss_train/definitions.py

+    name: str
+
+
+class Compute(SafeModel):


The layer was mainly for dealing with the conversion of gpu, but I believe this could be done with pydantic validators too.

One thing that is really important here: try to find a common denominator for the "new-style" APIs, so that we can re-use this for chains. In a way I'd even suggest putting these defs here into truss.base and then both train and chains can depend on it and use the same. It's a really bad user experience if these two products have "almost" the same APIs, but subtle differences, plus it's more upkeek and repo-clutter if we have so many definitions.

It was discussed in the API design/PRD doc to strive for this consolidation.

marius-baseten · 2025-03-05T18:31:55Z

truss-train/truss_train/definitions.py

+
+
+class TrainingJob(SafeModel):
+    image: Image


This is not following the PRD / Design doc - why the deviation?

The design had a separation of defining an image as a semi-permanent resource and then referencing that by an ID (not nesting the definition) in the training job.

marius-baseten · 2025-03-05T18:33:38Z

truss/base/custom_types.py

+class SafeModelNonSerializable(pydantic.BaseModel):
+    """Pydantic base model with reasonable config - allowing arbitrary types."""
+
+    model_config = pydantic.ConfigDict(


See other comment, this requires pydantic v2 in the entire truss package now.

marius-baseten · 2025-03-05T18:40:34Z

truss-train/truss_train/loader.py

+
+
+@contextlib.contextmanager
+def import_target(module_path: pathlib.Path) -> Iterator[definitions.TrainingProject]:


I never really liked that we have to import stuff from a path dynamcially for chains and truss. Since training is completely fresh product, is there a way we can avoid those brittle imports and use dependency injection or something like that?

For truss the problem is that it essentially works like this:

class TrussServer: def run(): import_from_path(user_module_path) if __name__ == "__main__": TrussServer().run()

The better patter would be:

class UserStuff: ... if __name__ == "__main__": TrussServer(UserStuff).run()

Synced offline - since this is purely on the CLI side for now, there's unfortunately no great way around this. Can revisit if this ever becomes runtime code!

nnarayen force-pushed the nikhil/introduce-truss-train branch 2 times, most recently from 536a1f7 to c81c767 Compare March 4, 2025 21:40

nnarayen commented Mar 4, 2025

View reviewed changes

nnarayen force-pushed the nikhil/introduce-truss-train branch from c81c767 to 670b6ff Compare March 4, 2025 21:49

nnarayen commented Mar 4, 2025

View reviewed changes

Introduce truss train subcommand, API stubs

5567716

nnarayen force-pushed the nikhil/introduce-truss-train branch from 670b6ff to 5567716 Compare March 4, 2025 21:57

nnarayen commented Mar 4, 2025

View reviewed changes

nnarayen requested review from marius-baseten and rcano-baseten March 4, 2025 21:59

nnarayen commented Mar 4, 2025

View reviewed changes

rcano-baseten approved these changes Mar 5, 2025

View reviewed changes

Address comments

6eec49c

nnarayen merged commit a3a471e into main Mar 5, 2025
5 checks passed

nnarayen deleted the nikhil/introduce-truss-train branch March 5, 2025 15:14

nnarayen mentioned this pull request Mar 5, 2025

Fix truss integration tests #1425

Merged

marius-baseten reviewed Mar 5, 2025

View reviewed changes

nnarayen mentioned this pull request Mar 5, 2025

Support local truss source code, refactor common pydantic types #1427

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce `truss train` subcommand, API stubs #1422

Introduce `truss train` subcommand, API stubs #1422

nnarayen commented Mar 4, 2025

nnarayen Mar 4, 2025

marius-baseten Mar 5, 2025

nnarayen Mar 4, 2025

nnarayen Mar 4, 2025

marius-baseten Mar 5, 2025

nnarayen Mar 5, 2025 •

edited

Loading

nnarayen Mar 4, 2025

marius-baseten Mar 5, 2025

nnarayen Mar 5, 2025

nnarayen Mar 4, 2025

nnarayen Mar 4, 2025 •

edited

Loading

nnarayen Mar 4, 2025

nnarayen Mar 4, 2025 •

edited

Loading

rcano-baseten Mar 5, 2025

nnarayen Mar 5, 2025

nnarayen Mar 4, 2025

rcano-baseten left a comment

rcano-baseten Mar 5, 2025

rcano-baseten Mar 5, 2025

nnarayen Mar 5, 2025

marius-baseten Mar 5, 2025

marius-baseten Mar 5, 2025

marius-baseten Mar 5, 2025

marius-baseten Mar 5, 2025

marius-baseten Mar 5, 2025

marius-baseten Mar 5, 2025

nnarayen Mar 5, 2025


		return resp.json()

		def create_training_job(self, project_id: str, job):

		accelerator: Optional[truss_config.AcceleratorSpec] = None


		class Runtime(SafeModel):

		from truss_train import definitions


		@contextlib.contextmanager

		runtime: Runtime = Runtime()


		class TrainingProject(SafeModel):



		@train.command(name="push")
		@click.argument("source", type=Path, required=True)



		@contextlib.contextmanager
		def import_target(module_path: pathlib.Path) -> Iterator[definitions.TrainingProject]:

Introduce truss train subcommand, API stubs #1422

Introduce truss train subcommand, API stubs #1422

Conversation

nnarayen commented Mar 4, 2025

🚀 What

💻 How

🔬 Testing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nnarayen Mar 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nnarayen Mar 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nnarayen Mar 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcano-baseten left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Introduce `truss train` subcommand, API stubs #1422

Introduce `truss train` subcommand, API stubs #1422

nnarayen Mar 5, 2025 •

edited

Loading

nnarayen Mar 4, 2025 •

edited

Loading

nnarayen Mar 4, 2025 •

edited

Loading