Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC]: Replace dfencoder with NVTabular #517

Closed
8 of 10 tasks
Tracked by #1141
BartleyR opened this issue Dec 6, 2022 · 3 comments
Closed
8 of 10 tasks
Tracked by #1141

[EPIC]: Replace dfencoder with NVTabular #517

BartleyR opened this issue Dec 6, 2022 · 3 comments
Assignees
Labels
dfp [Workflow] Related to the Digital Fingerprinting (DFP) workflow feature request New feature or request tracking Indicates this issue is a tracking issue with a task list referencing other issues

Comments

@BartleyR
Copy link
Contributor

BartleyR commented Dec 6, 2022

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

How would you describe the priority of this feature request

High

Please provide a clear description of problem this feature solves

Morpheus currently relies on a fork of dfencoder, especially for Digital Fingerprinting. There are a number of issues with this, including:

  • No native support for cuDF (Pandas only)
  • Frequent issues/errors that need to be patched
  • Sub-optimal performance (currently the bottleneck for the workflow)

This should be replaced with something that is more performant and that is GPU-aware.

Describe your ideal solution

The Merlin team created NVTabular, and it appears that this is a suitable replacement for dfencoder in our pipelines. The steps to get there include:

Tasks

  1. feature request
    dagardner-nv
  2. 3 of 3
    Priority 0 feature request
    drobison00
  3. 4 of 4
    feature request
    drobison00
  4. 1 of 5
    dfp feature request
    drobison00
  5. 0 of 4
    Priority 0 feature request
    drobison00

Describe any alternatives you have considered

No response

Additional context

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
  • I have searched the open feature requests and have found no duplicates for this feature request
@mdemoret-nv
Copy link
Contributor

The following is a quick breakdown of process and necessary steps to update DFEncoder to utilize NVTabular.

Updates to Morpheus core

We should update the morpheus.utils.column_info library to use NVTabular since the functionality is nearly identical but NVT will provide more features that are regularly tested

  1. Update dependencies to include nvtabular in the conda environement
  2. Replace the classes and functions in morpheus.utils.column_info with equivalents from NVTabular
    1. Map all implementations of ColumnInfo to NVTabular operation equivalents
      1. If any cannot be mapped, create custom operations
    2. Replace all uses of the DataFrameInputSchema class with nvt.Schema
      1. This could also use nvt.Workflow potentially
    3. Replace all uses of process_dataframe with nvt.Workflow.fit_transform()
    4. Add metadata tags for use in specific pipelines implementations (i.e. UserID column tags, date tags, etc.)
    5. Ideally, we would keep this as backwards compatible as possible keeping the existing class names and public API and replace the implementation with NVT
  3. Update documentation around new changes to morpheus.utils.column_info

Updates to dfencoder

Given an input dataframe, dfencoder currently does the following 2 things: 1) Setup the dataframe schema based on the column types and values and 2) Build the structure of the auto encoder model from that schema. The schema should be replaced with NVTabular (most likely just the updates to morpheus.utils.column_info) and then build the model from the NVT schema.

  1. Move dfencoder from it's own repository into Morpheus
    1. Copy the dfencoder source files into a new submodule: morpheus.models.dfencoder
      1. Make a copy of these files, as-is, called morpheus.models.dfencoder_old
        1. This is to allow side-by-side testing. They will be removed at the end.
    2. Deprecate the existing repository
  2. Update DFEncoder data ingestion and ETL to use NVTabular.
    1. Update dfencoder.AutoEncoder.init_features()
      1. init_features takes a sample dataframe and determines the schema from the DataFrame's column types
      2. Columns are put into 1 of 3 buckets: categorical, numerical and binary
      3. This should be replaced to use NVTabular's schema and improve on the number of column types available
      4. The ability should be added to dfencoder.AutoEncoder to allow manual overriding of the DataFrame schema (i.e. Supply specific operations or an entire schema which would bypass init_features all together)
    2. Update dfencoder.AutoEncoder.build_model()
      1. build_model takes the determined schema from a sample DataFrame and builds the auto encoder model from this schema.
      2. This requires mapping the schema to specific PyTorch functions before concatting everthing together for the core auto encoder layers.
      3. We would need to update the current code to map from NVTabular schema instead of the current system
      4. We should look at other models in Merlin for examples on how to do this
    3. Update dfencoder.AutoEncoder.prepare_df()
      1. prepare_df takes an input DataFrame and runs it through the preprocesser to get the final input before passing it to PyTorch
      2. This will need to be replaced with a nvt.Workflow that takes the schema determined from the init_features function
      3. We should look at speeding this up with parallelization or sharding across GPUs if needed.
  3. Update DFEncoder training loop to use NVTabular
    1. Update dfencoder.AutoEncoder.fit()
      1. fit currently calls build_model on first use, then prepare_df before finally running a simple PyTorch training loop over batches. This has a few downsides: 1) While the training loop is batched, the ETL and validation steps are not and 2) The loop is poorly implemented which makes extending/improving difficult.
      2. The loop needs to use data loaders to allow for batched processing of the ETL, training and validation parts to reduce memory consumption.
      3. Reference the NVTabular documentation for running a training loop with a data loader. This should be the foundation of the dfencoder training loop
  4. Add tests to DFEncoder
    1. Start with sample tests representative of the DFP type workloads (i.e. use DFP sample data) that run using the existing DFEncoder in the separate repo
    2. Run all those same tests using the new implementation, validating the output against the old code
    3. Add new tests for any additional features that were added above and beyond due to NVT

Updates to DFP

Once Morpheus and DFEncoder have been updated to use NVTabular, there will need to be updates to the DFP pipelines to match the changes and take advantage of the latest features.

  1. Update the Azure, Duo and other DFP workflows with the proper schema using NVT
    1. Assuming much of the API is backwards compatible fromt the changes to morpheus.utils.column_info, the number of changes should be small
    2. Will likely need to add tags and other metadata that we couldnt do before
  2. Update pipeline to use tagged metadata for determining specific columns.
    1. Currently we have a lot of properties called userid_column_name or datetime_column_name which should just pick this info up from the schema metadata
  3. Switch the DFP training loop to use the new DFEncoder classes inside of Morpheus.
    1. Again, assuming the class is mostly backwards compatible, this should mostly be a change to the import.
  4. Validate the DFP pipelines using the new classes
  5. Update DFP documentation with changes from the current implementation

Additional Features

Multi-GPU

Once a baseline version of the DFEncoder model has been created and validated, we could make a new class MultiGpuAutoEncoder, which derives from AutoEncoder but runs the training loop assuming multiple GPUs are available.

  1. Create MultiGpuAutoEncoder deriving from AutoEncoder
  2. Override the base functions as necessary to perform the training across multiple GPUs.
    1. This may involve overriding one or more functions depending on how multi-GPU training works in PyTorch
    2. Additional functionality may need to be pulled out of the base training loop so it can be used by the derived class as well
  3. Add multi GPU training tests
    1. TBD on how this would be run in CI
  4. Update documentation with examples on how to train using multi-GPU

WAG estimate on LOE: 2-3 weeks.

@jarmak-nv jarmak-nv changed the title [FEA]: Replace dfencoder with NVTabular [EPIC]: Replace dfencoder with NVTabular Apr 6, 2023
@mdemoret-nv
Copy link
Contributor

Moving tracking issue to next release

@mdemoret-nv mdemoret-nv added the dfp [Workflow] Related to the Digital Fingerprinting (DFP) workflow label Aug 21, 2023
@mdemoret-nv mdemoret-nv added this to the 23.11 - DFP Improvements milestone Aug 21, 2023
@mdemoret-nv mdemoret-nv removed this from the 23.11 - DFP Improvements milestone Dec 7, 2023
@mdemoret-nv
Copy link
Contributor

Deprioritizing

@mdemoret-nv mdemoret-nv added the tracking Indicates this issue is a tracking issue with a task list referencing other issues label Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dfp [Workflow] Related to the Digital Fingerprinting (DFP) workflow feature request New feature or request tracking Indicates this issue is a tracking issue with a task list referencing other issues
Projects
Status: Done
Development

No branches or pull requests

3 participants