-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to train-text-from-scratch #4791
Comments
Having same problem here:
I have enabled the alloc debuging (removing the assertion so I could get full trace):
|
Same. No matter what the data size is or how I chunk it up, data that was previously very much able to be processed is no longer able to be processed. Fine-tuning works, so there's something fundamentally different between references and headers of those two binaries, or there's something broken in the "train" binary itself. This is what I see in a diff between non-working version and working version: diff train-text-from-scratch.cpp train-text-from-scratch.cpp.old
|
The issue is probably related to changes in ggml-alloc as explained in #3548 (comment). If you want a quick workaround until this is fixed, removing the calls to |
yes, commenting out the |
It is not a permanent fix, the leak would need to be fixed. The correct fix would be to ensure that allocators aren't freed while the tensors allocated with it are still in use, which is the source of the issue. |
I've played a bit with this in between previous versions (pre-gguf) and the only (obvious) fallout seems to be that a segfault occurs when it can't detect which tensors are freed or not. The really neat thing about this program is that it usually faults after writing a checkpoint and model to disk, and both of those are usually recoverable for restarting training without losing progress. I should have figured this was a parallel issue. Thank you kindly for the help with this. |
* Fix issue with alloc causing max_compute_size to be calculated * remove ggml_allocr_free as suggested in issue #4791
* Fix issue with alloc causing max_compute_size to be calculated * remove ggml_allocr_free as suggested in issue ggerganov#4791
This issue is stale because it has been open for 30 days with no activity. |
* Fix issue with alloc causing max_compute_size to be calculated * remove ggml_allocr_free as suggested in issue ggerganov#4791
This issue was closed because it has been inactive for 14 days since being marked as stale. |
I failed to train-text-from-scratch. Can anyone help?
The text was updated successfully, but these errors were encountered: