Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support arm64 for Huggingface trainer #1986

Closed
johnugeorge opened this issue Jan 12, 2024 · 11 comments · Fixed by #2028
Closed

Support arm64 for Huggingface trainer #1986

johnugeorge opened this issue Jan 12, 2024 · 11 comments · Fixed by #2028

Comments

@johnugeorge
Copy link
Member

Currently arm64 support for Hugging face trainer image is removed due to low resources in Github CI. This is to enabled later after further investigation.

@tenzen-y
Copy link
Member

tenzen-y commented Mar 8, 2024

/good-first-issue

Copy link

@tenzen-y:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tariq-hasan
Copy link
Contributor

I am interested in this issue.

I am wondering if this requires further triage before work can begin.

@tenzen-y
Copy link
Member

tenzen-y commented Mar 8, 2024

I am interested in this issue.

I am wondering if this requires further triage before work can begin.

Feel free to assign this yourself with /assign.

@tariq-hasan
Copy link
Contributor

/assign

@tariq-hasan
Copy link
Contributor

tariq-hasan commented Mar 9, 2024

I presume that the job for the publication of the trainer-huggingface image failed in PR #1985.

PR #1987 appears to have fixed the issue by removing arm64 from the list of platforms that are supported for the image.

I was wondering how it was determined that arm64 is the root-cause of the issue as the logs do not appear to be descriptive in that regard.

System.IO.IOException: No space left on device : '/home/runner/runners/2.311.0/_diag/Worker_20240112-135835-utc.log'
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.Strategies.BufferedFileStreamStrategy.FlushWrite()
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at GitHub.Runner.Common.HostTraceListener.WriteHeader(String source, TraceEventType eventType, Int32 id)
   at GitHub.Runner.Common.HostTraceListener.TraceEvent(TraceEventCache eventCache, String source, TraceEventType eventType, Int32 id, String message)
   at System.Diagnostics.TraceSource.TraceEvent(TraceEventType eventType, Int32 id, String message)
   at GitHub.Runner.Worker.Worker.RunAsync(String pipeIn, String pipeOut)
   at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)
System.IO.IOException: No space left on device : '/home/runner/runners/2.311.0/_diag/Worker_20240112-135835-utc.log'
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.Strategies.BufferedFileStreamStrategy.FlushWrite()
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at GitHub.Runner.Common.HostTraceListener.WriteHeader(String source, TraceEventType eventType, Int32 id)
   at GitHub.Runner.Common.HostTraceListener.TraceEvent(TraceEventCache eventCache, String source, TraceEventType eventType, Int32 id, String message)
   at System.Diagnostics.TraceSource.TraceEvent(TraceEventType eventType, Int32 id, String message)
   at GitHub.Runner.Common.Tracing.Error(Exception exception)
   at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)
Unhandled exception. System.IO.IOException: No space left on device : '/home/runner/runners/2.311.0/_diag/Worker_20240112-135835-utc.log'
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.Strategies.BufferedFileStreamStrategy.FlushWrite()
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at System.Diagnostics.TraceSource.Flush()
   at GitHub.Runner.Common.TraceManager.Dispose(Boolean disposing)
   at GitHub.Runner.Common.TraceManager.Dispose()
   at GitHub.Runner.Common.HostContext.Dispose(Boolean disposing)
   at GitHub.Runner.Common.HostContext.Dispose()
   at GitHub.Runner.Worker.Program.Main(String[] args)

@tenzen-y
Copy link
Member

tenzen-y commented Mar 9, 2024

I presume that the job for the publication of the trainer-huggingface image failed in PR #1985.

PR #1987 appears to have fixed the issue by removing arm64 from the list of platforms that are supported for the image.

I was wondering how it was determined that arm64 is the root-cause of the issue as the logs do not appear to be descriptive in that regard.

System.IO.IOException: No space left on device : '/home/runner/runners/2.311.0/_diag/Worker_20240112-135835-utc.log'
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.Strategies.BufferedFileStreamStrategy.FlushWrite()
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at GitHub.Runner.Common.HostTraceListener.WriteHeader(String source, TraceEventType eventType, Int32 id)
   at GitHub.Runner.Common.HostTraceListener.TraceEvent(TraceEventCache eventCache, String source, TraceEventType eventType, Int32 id, String message)
   at System.Diagnostics.TraceSource.TraceEvent(TraceEventType eventType, Int32 id, String message)
   at GitHub.Runner.Worker.Worker.RunAsync(String pipeIn, String pipeOut)
   at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)
System.IO.IOException: No space left on device : '/home/runner/runners/2.311.0/_diag/Worker_20240112-135835-utc.log'
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.Strategies.BufferedFileStreamStrategy.FlushWrite()
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at GitHub.Runner.Common.HostTraceListener.WriteHeader(String source, TraceEventType eventType, Int32 id)
   at GitHub.Runner.Common.HostTraceListener.TraceEvent(TraceEventCache eventCache, String source, TraceEventType eventType, Int32 id, String message)
   at System.Diagnostics.TraceSource.TraceEvent(TraceEventType eventType, Int32 id, String message)
   at GitHub.Runner.Common.Tracing.Error(Exception exception)
   at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)
Unhandled exception. System.IO.IOException: No space left on device : '/home/runner/runners/2.311.0/_diag/Worker_20240112-135835-utc.log'
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.Strategies.BufferedFileStreamStrategy.FlushWrite()
   at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
   at System.Diagnostics.TextWriterTraceListener.Flush()
   at System.Diagnostics.TraceSource.Flush()
   at GitHub.Runner.Common.TraceManager.Dispose(Boolean disposing)
   at GitHub.Runner.Common.TraceManager.Dispose()
   at GitHub.Runner.Common.HostContext.Dispose(Boolean disposing)
   at GitHub.Runner.Common.HostContext.Dispose()
   at GitHub.Runner.Worker.Program.Main(String[] args)

That error was caused by the multi-arch image building since the multi-arch image building uses a larger amount of storage than the single-arch building. But, recently GitHub increased computing resources in OSS project CI: https://github.blog/2024-01-17-github-hosted-runners-double-the-power-for-open-source/
So, multi-arch image building should work.

@tariq-hasan
Copy link
Contributor

The following is the disk usage for amd64.

Disk usage before cleanup:
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        73G   53G   20G  73% /
tmpfs           7.9G  172K  7.9G   1% /dev/shm
tmpfs           3.2G  1.1M  3.2G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda15      105M  6.1M   99M   6% /boot/efi
/dev/sdb1        74G  4.1G   66G   6% /mnt
tmpfs           1.6G   12K  1.6G   1% /run/user/1001
Disk usage after cleanup:
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        73G   [32]G   42G  44% /
tmpfs           7.9G  172K  7.9G   1% /dev/shm
tmpfs           3.2G  1.1M  3.2G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda15      105M  6.1M   99M   6% /boot/efi
/dev/sdb1        74G  4.1G   66G   6% /mnt
tmpfs           1.6G   12K  1.6G   1% /run/user/1001
Disk usage after extracting parent image:
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        73G   59G   14G  81% /
tmpfs           7.9G  172K  7.9G   1% /dev/shm
tmpfs           3.2G  1.2M  3.2G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda15      105M  6.1M   99M   6% /boot/efi
/dev/sdb1        74G  4.1G   66G   6% /mnt
tmpfs           1.6G   12K  1.6G   1% /run/user/1001

The following is the disk usage for arm64.

Disk usage before cleanup:
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        73G   53G   20G  73% /
tmpfs           7.9G  172K  7.9G   1% /dev/shm
tmpfs           3.2G  1.1M  3.2G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda15      105M  6.1M   99M   6% /boot/efi
/dev/sdb1        74G  4.1G   66G   6% /mnt
tmpfs           1.6G   12K  1.6G   1% /run/user/1001
Disk usage after cleanup:
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        73G   [32]G   42G  44% /
tmpfs           7.9G  172K  7.9G   1% /dev/shm
tmpfs           3.2G  1.1M  3.2G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda15      105M  6.1M   99M   6% /boot/efi
/dev/sdb1        74G  4.1G   66G   6% /mnt
tmpfs           1.6G   12K  1.6G   1% /run/user/1001
Disk usage after extracting parent image:
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        73G   64G  9.4G  88% /
tmpfs           7.9G  172K  7.9G   1% /dev/shm
tmpfs           3.2G  1.2M  3.2G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda15      105M  6.1M   99M   6% /boot/efi
/dev/sdb1        74G  4.1G   66G   6% /mnt
tmpfs           1.6G   12K  1.6G   1% /run/user/1001

The issue has to do with the size of the downloaded parent image.

@tenzen-y
Copy link
Member

tenzen-y commented Mar 9, 2024

@tariq-hasan No, I meant multi-arch image. Your logs indicate the single arch image.
Please refer to https://docs.docker.com/build/building/multi-platform/

@tariq-hasan
Copy link
Contributor

tariq-hasan commented Mar 9, 2024

But I would think that even when using Docker Buildx with the Docker container driver for multi-platform builds, the base image is typically downloaded separately for each architecture.

I presume Docker Buildx creates a separate container for each platform specified in the build and each container runs the build process for the corresponding architecture.

So that would mean that the disk space would be used up even for a multi-architecture image build process.

Should we create a matrix of supported platforms so that we can distribute the execution across parallel runners and mitigate the issue with disk usage?

@tenzen-y
Copy link
Member

But I would think that even when using Docker Buildx with the Docker container driver for multi-platform builds, the base image is typically downloaded separately for each architecture.

I presume Docker Buildx creates a separate container for each platform specified in the build and each container runs the build process for the corresponding architecture.

So that would mean that the disk space would be used up even for a multi-architecture image build process.

That is correct, but when we faced the disk pressure issue, increasing resources haven't yet applied to the kubeflow project.
(https://github.blog/2024-01-17-github-hosted-runners-double-the-power-for-open-source/)

After that, the action resources were increased. So, I guess that we shouldn't face the same issue, again.

Have you tried to run CI with multi-platform image building?

Should we create a matrix of supported platforms so that we can distribute the execution across parallel runners and mitigate the issue with disk usage?

No, we shouldn't do it as described above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants