Skip to content

Commit

Permalink
add code to record memory usage
Browse files Browse the repository at this point in the history
  • Loading branch information
LiGuihong committed Jan 13, 2025
1 parent bb93ccb commit 56f285f
Showing 1 changed file with 3 additions and 0 deletions.
3 changes: 3 additions & 0 deletions megatron/training/training.py
Original file line number Diff line number Diff line change
Expand Up @@ -979,6 +979,9 @@ def training_log(loss_dict, total_loss_dict, learning_rate, decoupled_learning_r
args.skipped_train_samples)
log_string += ' elapsed time per iteration (ms): {:.1f} |'.format(
elapsed_time_per_iteration * 1000.0)
free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()
mem_usages = 1 - free_gpu_memory / total_gpu_memory
log_string += " mem usages: {:.4f} |".format(mem_usages)
if args.log_throughput:
log_string += f' throughput per GPU (TFLOP/s/GPU): {throughput:.1f} |'
if args.log_timers_to_tensorboard:
Expand Down

0 comments on commit 56f285f

Please sign in to comment.