Skip to content

Commit

Permalink
Merge pull request #45 from ROCm/memory_usage
Browse files Browse the repository at this point in the history
[MinorFix] Add code to record memory usage
  • Loading branch information
LiGuihong authored Jan 24, 2025
2 parents 0a77df7 + 56f285f commit e35b94b
Showing 1 changed file with 3 additions and 0 deletions.
3 changes: 3 additions & 0 deletions megatron/training/training.py
Original file line number Diff line number Diff line change
Expand Up @@ -979,6 +979,9 @@ def training_log(loss_dict, total_loss_dict, learning_rate, decoupled_learning_r
args.skipped_train_samples)
log_string += ' elapsed time per iteration (ms): {:.1f} |'.format(
elapsed_time_per_iteration * 1000.0)
free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()
mem_usages = 1 - free_gpu_memory / total_gpu_memory
log_string += " mem usages: {:.4f} |".format(mem_usages)
if args.log_throughput:
log_string += f' throughput per GPU (TFLOP/s/GPU): {throughput:.1f} |'
if args.log_timers_to_tensorboard:
Expand Down

0 comments on commit e35b94b

Please sign in to comment.