Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support arm64 for Hugging Face trainer #2028

Conversation

tariq-hasan
Copy link
Contributor

What this PR does / why we need it:

Following suggestion from @tenzen-y I have implemented a fix in the CI workflow to add support for multi-platform linux/amd64 and linux/arm64 image building for the Hugging Face trainer.

The necessary code changes follow from kserve/kserve#3411.

In particular, docker images are pruned and the docker data directory is moved from the root volume to the /mnt volume.

The resulting reduction in disk space usage in the root volume helps the CI workflow to complete without errors.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #1986

Checklist:

  • Docs included if any changes are user facing

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>
Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>
Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>
Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>
@tariq-hasan tariq-hasan force-pushed the support-arm64-for-huggingface-trainer branch from 55946f9 to e7f4623 Compare March 13, 2024 10:55
@tenzen-y
Copy link
Member

@tariq-hasan Great work!
@kubeflow/wg-training-leads Could you approve CI?

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
/hold

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tariq-hasan, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coveralls
Copy link

Pull Request Test Coverage Report for Build 8263475761

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.09%) to 42.908%

Totals Coverage Status
Change from base Build 8248109561: 0.09%
Covered Lines: 3757
Relevant Lines: 8756

💛 - Coveralls

@johnugeorge
Copy link
Member

Thanks for this.
/hold cancel

@google-oss-prow google-oss-prow bot merged commit 8433edc into kubeflow:master Mar 13, 2024
37 checks passed
@tariq-hasan tariq-hasan deleted the support-arm64-for-huggingface-trainer branch March 18, 2024 02:36
tedhtchang pushed a commit to tedhtchang/training-operator that referenced this pull request Apr 5, 2024
* echoed disk usage before cleanup

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* pruned docker images

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* moved docker data directory

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* added arm64 in the list of platforms for trainer-huggingface

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

---------

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>
deepanker13 pushed a commit to deepanker13/deepanker-training-operator that referenced this pull request Apr 8, 2024
* echoed disk usage before cleanup

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* pruned docker images

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* moved docker data directory

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* added arm64 in the list of platforms for trainer-huggingface

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

---------

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>
Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>
johnugeorge pushed a commit to johnugeorge/training-operator that referenced this pull request Apr 28, 2024
* echoed disk usage before cleanup

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* pruned docker images

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* moved docker data directory

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* added arm64 in the list of platforms for trainer-huggingface

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

---------

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>
johnugeorge pushed a commit to johnugeorge/training-operator that referenced this pull request Apr 28, 2024
* echoed disk usage before cleanup

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* pruned docker images

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* moved docker data directory

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* added arm64 in the list of platforms for trainer-huggingface

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

---------

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support arm64 for Huggingface trainer
4 participants