set min seq len by default #621

pstjohn · 2025-01-17T20:44:39Z

Description

In https://nvbugspro.nvidia.com/bug/5060664 they notice a warning message about performance when pretraining with variable sequence lengths. This is largely an oversight since our test scripts didn't set both minimum and maximum seq_lens. We should have the default if min_seq_length is omitted be to just pad to the maximum sequence length for performance reasons.

Type of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor
Documentation update
Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels:

SKIP_CI - Skip all continuous integration tests
INCLUDE_NOTEBOOKS_TESTS - Execute notebook validation tests in pytest

Note

By default, the notebooks validation tests are skipped unless explicitly enabled.

Usage

TODO: Add code snippet

Pre-submit Checklist

I have tested these changes locally
I have updated the documentation accordingly
I have added/updated tests as needed
All existing tests pass successfully

Signed-off-by: Peter St. John <pstjohn@nvidia.com>

sichu2023

TY!

farhadrgh

Can you add this change to finetune Datamodule here:
https://github.com/NVIDIA/bionemo-framework/blob/main/sub-packages/bionemo-esm2/src/bionemo/esm2/model/finetune/datamodule.py

codecov-commenter · 2025-01-17T22:05:10Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.62%. Comparing base (7f9dd97) to head (addc0ad).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #621   +/-   ##
=======================================
  Coverage   86.62%   86.62%           
=======================================
  Files         116      116           
  Lines        6961     6961           
=======================================
  Hits         6030     6030           
  Misses        931      931

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

### Description In https://nvbugspro.nvidia.com/bug/5060664 they notice a warning message about performance when pretraining with variable sequence lengths. This is largely an oversight since our test scripts didn't set both minimum and maximum seq_lens. We should have the default if min_seq_length is omitted be to just pad to the maximum sequence length for performance reasons. ### Type of changes  - [x] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels: - [SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci) - Skip all continuous integration tests - [INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests) - Execute notebook validation tests in pytest > [!NOTE] > By default, the notebooks validation tests are skipped unless explicitly enabled. ### Usage  ```python TODO: Add code snippet ``` ### Pre-submit Checklist  - [x] I have tested these changes locally - [x] I have updated the documentation accordingly - [x] I have added/updated tests as needed - [x] All existing tests pass successfully Signed-off-by: Peter St. John <pstjohn@nvidia.com> Signed-off-by: Polina Binder <pbinder@nvidia.com>

set min seq len by default

addc0ad

Signed-off-by: Peter St. John <pstjohn@nvidia.com>

pstjohn requested review from jstjohn and trvachov as code owners January 17, 2025 20:44

pstjohn requested review from sichu2023 and farhadrgh January 17, 2025 20:44

sichu2023 approved these changes Jan 17, 2025

View reviewed changes

farhadrgh approved these changes Jan 17, 2025

View reviewed changes

pstjohn enabled auto-merge January 17, 2025 20:58

pstjohn disabled auto-merge January 17, 2025 20:58

jstjohn approved these changes Jan 18, 2025

View reviewed changes

pstjohn added this pull request to the merge queue Jan 18, 2025

Merged via the queue into NVIDIA:main with commit 0c990a7 Jan 18, 2025
7 of 15 checks passed

pstjohn deleted the pstjohn/set-min-seq-len branch January 18, 2025 02:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

set min seq len by default #621

set min seq len by default #621

pstjohn commented Jan 17, 2025

sichu2023 left a comment

farhadrgh left a comment

codecov-commenter commented Jan 17, 2025

set min seq len by default #621

set min seq len by default #621

Conversation

pstjohn commented Jan 17, 2025

Description

Type of changes

CI Pipeline Configuration

Usage

Pre-submit Checklist

sichu2023 left a comment

Choose a reason for hiding this comment

farhadrgh left a comment

Choose a reason for hiding this comment

codecov-commenter commented Jan 17, 2025

Codecov Report