CUDA Memory Profile Analyzer #9860

tonyjie · 2024-07-24T05:36:17Z

What does this PR do ?

Collect CUDA memory snapshot based on the previous commit (CUDA memory profile #9096 ), and further analyze which parts of the model contribute to the total memory footprint.
The memory profile will generate two pickle file, one for weight, one for activation. The user can load the file in the below page: https://pytorch.org/memory_viz
If out-of-memory (CUDA OOM) occurs, the tool will capture the snapshot before OOM occurs, and generate the pickle file.
With knobs analysis_enabled: True, the memory profile analyzer will generate two csv files each for weight/activation/OOM. The output csv files includes:
1. Weight
  - alive_memory_weight.csv
  - group_by_alloc_frames_weight.csv
2. Activation
  - alive_memory_memory.csv
  - group_by_alloc_frames_memory.csv
3. OOM
  - alive_memory_oom.csv
  - group_by_alloc_frames_oom.csv

Changelog

Fix some issues of previous memory profile
- batch_idx mismatch issue.
- max_entries is too small, which makes the generated snapshot easily truncated.
Add weight memory capturing.
Add OOM case support.
Added the option to enable further analysis to the generated memory snapshot file. The analyzer finds the peak memory of the snapshot, and generate two csv files, including
1. All the alive memory buffers at that peak moment
2. Group them by allocation frames, showing the relationship between model layer and its corresponding memory footprint.

Usage

Add the below knobs to the yaml run config.

# Memory Profile
memory_profile:                                                                      
   enabled: true                                                                      
   start_step: 1                                                                      
   end_step: 3                                                                        
   rank: 0                                                                            
   output_path: <path/to/out_file>
   analysis_enabled: true

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

ericharper · 2024-07-31T21:35:28Z

nemo/utils/memory_profile_analyzer.py

this file needs a copyright header

akoumpa · 2024-08-01T06:20:37Z

nemo/core/classes/modelPT.py

@@ -204,6 +206,8 @@ def __init__(self, cfg: DictConfig, trainer: Trainer = None):

        # Setup nsys profiling if it has been enabled in the model config
        self._setup_profiling()
+        # real accurate _batch_idx. We found that the `batch_idx` in `on_train_batch_start` and `on_train_batch_end` has a bug. 


Hi, can you expand what bug was found?

akoumpa · 2024-08-01T06:21:09Z

nemo/core/classes/modelPT.py

@@ -49,6 +49,8 @@
 from nemo.utils.debug_hook import register_debug_hooks
 from nemo.utils.exceptions import NeMoBaseException
 from nemo.utils.get_rank import get_rank, is_global_rank_zero
+# from nemo.utils.memory_profile_analyzer import peak_memory_analysis_activation, peak_memory_analysis_weight, peak_memory_analysis_oom


is the commented line needed?

akoumpa · 2024-08-01T06:24:42Z

nemo/core/classes/modelPT.py

+                    logging.info(f"===== Memory Profile Analysis: OOM ======")
+                    peak_memory_analysis(self._memory_profile_snapshot_file_oom, self._memory_profile_analysis_path, 'oom', self._memory_profile_rank)
+                else:
+                    raise Exception(f"Snapshot file not found: {self._memory_profile_snapshot_file_oom}")


Maybe move this after line 1833 torch.cuda.memory._dump_snapshot?

akoumpa · 2024-08-01T06:25:52Z

nemo/core/classes/modelPT.py

+                            return
+
+                        # Call the analysis function
+                        if self._memory_profile_analysis_enabled:


do you need this if ? I would assume _memory_profile_analysis_enabled does not change value and in line 1880 you already check whether it's true or not.

akoumpa · 2024-08-01T06:26:16Z

nemo/core/classes/modelPT.py

+                                logging.info(f"====== Memory Profile Analysis: Weight ======")
+                                peak_memory_analysis(self._memory_profile_snapshot_file_weight, self._memory_profile_analysis_path, 'weight', self._memory_profile_rank)
+                            else:
+                                raise Exception(f"Snapshot file not found: {self._memory_profile_snapshot_file_weight}")    


same as https://github.com/NVIDIA/NeMo/pull/9860/files#r1699506572

akoumpa · 2024-08-01T06:27:11Z

nemo/core/classes/modelPT.py

-                    if batch_idx >= self._memory_profile_start_step and get_rank() == self._memory_profile_rank:
-                        logging.info("====== Start CUDA memory profiling ======")
-                        torch.cuda.memory._record_memory_history(max_entries=100000)
+                    if self._real_batch_idx == self._memory_profile_start_step and get_rank() == self._memory_profile_rank:


why is it self._real_batch_idx == self._memory_profile_start_step instead of self._real_batch_idx >= self._memory_profile_start_step ?

akoumpa · 2024-08-01T06:27:35Z

nemo/core/classes/modelPT.py

                        logging.info("====== End nsys profiling ======")
                        torch.cuda.cudart().cudaProfilerStop()
                        self._nsys_profile_complete = True

            if hasattr(self, '_memory_profile_enabled'):
                if self._memory_profile_enabled and not self._memory_profile_complete:
-                    if batch_idx >= self._memory_profile_end_step and get_rank() == self._memory_profile_rank:
-                        logging.info("====== End CUDA memory profiling ======")
+                    if self._real_batch_idx == self._memory_profile_end_step and get_rank() == self._memory_profile_rank:


same as https://github.com/NVIDIA/NeMo/pull/9860/files#r1699509060

akoumpa · 2024-08-01T06:28:15Z

nemo/core/classes/modelPT.py

                        )
                        torch.cuda.memory._record_memory_history(enabled=None)
                        self._memory_profile_complete = True
+                    # Call the analysis function
+                    if self._memory_profile_analysis_enabled and self._memory_profile_complete:


same as previously, self._memory_profile_analysis_enabled should be true? due to line 1890

akoumpa · 2024-08-01T06:29:25Z

nemo/utils/memory_profile_analyzer.py

+    Add `\n` in between each frame for the readability. 
+    """
+    # Prune Frames
+    after_prune_frames = [prune_frames(x[3]) for x in alive_memory]


what's x[3]? can you add a comment?

akoumpa · 2024-08-01T06:31:10Z

nemo/utils/memory_profile_analyzer.py

+
+
+# ===== Function: for two time points, check the corresponding alive memory, and compare them to see: what's new, what's gone, what's unchanged.
+def compare_alive_memory(tracker, time_us_1, time_us_2):


can you add a couple tests for this?

akoumpa · 2024-08-01T06:34:23Z

nemo/utils/memory_profile_analyzer.py

+            return frame
+    return None
+
+def alloc_memory_timeline(trace):


if that finds the maximum/min alloc memory and the corresponding timestamps, it would be helpful if that was reflected in the name.

akoumpa · 2024-08-01T06:34:35Z

nemo/utils/memory_profile_analyzer.py

+    for idx, timepoint in enumerate(trace):
+        (time_us, addr, action, size, frames, stream) = read_tp(timepoint)
+
+        if (action == "alloc"):


you don't need brackets.

akoumpa · 2024-08-01T06:35:37Z

nemo/utils/memory_profile_analyzer.py

+
+
+
+def record_alloc_memory_timeline(trace):


this looks very similar to alloc_memory_timeline can you refactor to reduce duplicate code?

akoumpa

Thanks, just a few minor comments, this looks great overall.

akoumpa · 2024-08-01T06:38:06Z

One more request @tonyjie , can you rebase to the latest main & use --signoff to your commits and push again ? Otherwise CI won't play.

nemo/utils/memory_profile_analyzer.py

…l analysis the memory at the global peak of the trace, and generate CSV files Signed-off-by: tonyjie <jl4257@cornell.edu>

…activation. Signed-off-by: tonyjie <jl4257@cornell.edu>

Signed-off-by: tonyjie <jl4257@cornell.edu>

…ing the setup Signed-off-by: tonyjie <jl4257@cornell.edu>

Signed-off-by: tonyjie <jl4257@cornell.edu>

…ch_version; fix other minor issues based on review

github-actions · 2024-09-24T01:58:36Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

github-actions · 2024-10-01T02:04:36Z

This PR was closed because it has been inactive for 7 days since being marked as stale.

pzelasko · 2024-10-22T14:28:28Z

This PR seems to have slipped through. Should we merge it? @ericharper @titu1994

github-actions bot added the core Changes to NeMo Core label Jul 24, 2024

tonyjie marked this pull request as draft July 24, 2024 05:38

tonyjie marked this pull request as ready for review July 24, 2024 05:39

ericharper reviewed Jul 31, 2024

View reviewed changes

nemo/utils/memory_profile_analyzer.py

Copy link

Collaborator

ericharper Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file needs a copyright header

ericharper requested review from akoumpa and titu1994 July 31, 2024 21:35

akoumpa reviewed Aug 1, 2024

View reviewed changes

akoumpa requested changes Aug 1, 2024

View reviewed changes

github-advanced-security bot found potential problems Aug 1, 2024

View reviewed changes

nemo/utils/memory_profile_analyzer.py Fixed Show fixed Hide fixed

nemo/utils/memory_profile_analyzer.py Fixed Show fixed Hide fixed

nemo/utils/memory_profile_analyzer.py Fixed Show fixed Hide fixed

nemo/utils/memory_profile_analyzer.py Fixed Show fixed Hide fixed

tonyjie force-pushed the jiajiel/mem_snapshot_pr1 branch from f246513 to d57a9f5 Compare August 6, 2024 23:06

tonyjie requested review from pablo-garay and ko3n1g as code owners August 6, 2024 23:06

github-actions bot added TTS ASR NLP CI common labels Aug 6, 2024

tonyjie force-pushed the jiajiel/mem_snapshot_pr1 branch 5 times, most recently from dc21178 to 8e81820 Compare September 8, 2024 22:27

tonyjie dismissed akoumpa’s stale review via e8231fa September 8, 2024 22:29

tonyjie force-pushed the jiajiel/mem_snapshot_pr1 branch 2 times, most recently from 8e81820 to f11590e Compare September 8, 2024 23:15

tonyjie and others added 12 commits September 8, 2024 16:15

Add peak_memory_analysis: with analysis_enabled in the config, we wil…

47e61cc

…l analysis the memory at the global peak of the trace, and generate CSV files Signed-off-by: tonyjie <jl4257@cornell.edu>

Add weight memory capturing and analysis. Now we separate weight and …

bc1ba35

…activation. Signed-off-by: tonyjie <jl4257@cornell.edu>

add OOM support. Verify it with LLaMA70b model on H100

b46c226

Signed-off-by: tonyjie <jl4257@cornell.edu>

add newline between frame to make it more readable

cf14018

Signed-off-by: tonyjie <jl4257@cornell.edu>

Fix a bug that occurs on multi-GPU setting. Can't call get_rank() dur…

684ef24

…ing the setup Signed-off-by: tonyjie <jl4257@cornell.edu>

comment logging for the real_batch_idx

5935314

Signed-off-by: tonyjie <jl4257@cornell.edu>

add copyright; remove unused commented line

52b371d

Signed-off-by: tonyjie <jl4257@cornell.edu>

code cleaning based on PR review

ba7f13e

Signed-off-by: tonyjie <jl4257@cornell.edu>

Apply isort and black reformatting

65a33ea

Signed-off-by: tonyjie <jl4257@cornell.edu>

code cleaning

e365893

Signed-off-by: tonyjie <jl4257@cornell.edu>

Apply isort and black reformatting

8d33368

Signed-off-by: tonyjie <jl4257@cornell.edu>

remove unused code

2cb2497

Signed-off-by: tonyjie <jl4257@cornell.edu>

tonyjie force-pushed the jiajiel/mem_snapshot_pr1 branch from f11590e to 2cb2497 Compare September 8, 2024 23:16

guard the pytorch memory snapshot function by providing a minimum_tor…

165e43e

…ch_version; fix other minor issues based on review

github-actions bot removed the stale label Sep 9, 2024

akoumpa added Run CICD and removed Run CICD labels Sep 9, 2024

github-actions bot added the stale label Sep 24, 2024

github-actions bot closed this Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Memory Profile Analyzer #9860

CUDA Memory Profile Analyzer #9860

tonyjie commented Jul 24, 2024

ericharper Jul 31, 2024

akoumpa Aug 1, 2024

akoumpa Aug 1, 2024

akoumpa Aug 1, 2024

akoumpa Aug 1, 2024

akoumpa Aug 1, 2024

akoumpa Aug 1, 2024

akoumpa Aug 1, 2024

akoumpa Aug 1, 2024

akoumpa Aug 1, 2024 •

edited

Loading

akoumpa Aug 1, 2024

akoumpa Aug 1, 2024

akoumpa Aug 1, 2024

akoumpa Aug 1, 2024

akoumpa left a comment

akoumpa commented Aug 1, 2024

github-actions bot commented Sep 24, 2024

github-actions bot commented Oct 1, 2024

pzelasko commented Oct 22, 2024



		# ===== Function: for two time points, check the corresponding alive memory, and compare them to see: what's new, what's gone, what's unchanged.
		def compare_alive_memory(tracker, time_us_1, time_us_2):

CUDA Memory Profile Analyzer #9860

CUDA Memory Profile Analyzer #9860

Conversation

tonyjie commented Jul 24, 2024

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akoumpa Aug 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akoumpa left a comment

Choose a reason for hiding this comment

akoumpa commented Aug 1, 2024

github-actions bot commented Sep 24, 2024

github-actions bot commented Oct 1, 2024

pzelasko commented Oct 22, 2024

akoumpa Aug 1, 2024 •

edited

Loading