[EPIC]: Replace dfencoder with NVTabular #517

BartleyR · 2022-12-06T19:56:56Z

Is this a new feature, an improvement, or a change to existing functionality?

Improvement

How would you describe the priority of this feature request

High

Please provide a clear description of problem this feature solves

Morpheus currently relies on a fork of dfencoder, especially for Digital Fingerprinting. There are a number of issues with this, including:

No native support for cuDF (Pandas only)
Frequent issues/errors that need to be patched
Sub-optimal performance (currently the bottleneck for the workflow)

This should be replaced with something that is more performant and that is GPU-aware.

Describe your ideal solution

The Merlin team created NVTabular, and it appears that this is a suitable replacement for dfencoder in our pipelines. The steps to get there include:

Tasks

Give feedback

Evaluate NVTabular and establish what the parity is between it and dfencoder
Document differences and requirements for NVTabular and communicate with NVTabular team
Migrate dfencoder into the Morpheus repo #753

feature request
NVTabular Updates to Morpheus Core #862

3 of 3

Priority 0 feature request
NVTabular Updates to dfencoder #865

4 of 4

feature request
NVTabular Updates to DFP #870

1 of 5

dfp feature request
NVTabular MultiGPU #871

0 of 4

Priority 0 feature request
Create a proof-of-concept of the Digital Fingerprinting workflow using NVTabular
Benchmark the proof-of-concept to compare
Transition existing code using dfencoder to NVTabular
Options

Describe any alternatives you have considered

No response

Additional context

No response

Code of Conduct

I agree to follow this project's Code of Conduct
I have searched the open feature requests and have found no duplicates for this feature request

The text was updated successfully, but these errors were encountered:

mdemoret-nv · 2023-02-21T23:32:37Z

The following is a quick breakdown of process and necessary steps to update DFEncoder to utilize NVTabular.

Updates to Morpheus core

We should update the morpheus.utils.column_info library to use NVTabular since the functionality is nearly identical but NVT will provide more features that are regularly tested

Update dependencies to include nvtabular in the conda environement
Replace the classes and functions in morpheus.utils.column_info with equivalents from NVTabular
1. Map all implementations of ColumnInfo to NVTabular operation equivalents
  1. If any cannot be mapped, create custom operations
2. Replace all uses of the DataFrameInputSchema class with nvt.Schema
  1. This could also use nvt.Workflow potentially
3. Replace all uses of process_dataframe with nvt.Workflow.fit_transform()
4. Add metadata tags for use in specific pipelines implementations (i.e. UserID column tags, date tags, etc.)
5. Ideally, we would keep this as backwards compatible as possible keeping the existing class names and public API and replace the implementation with NVT
Update documentation around new changes to morpheus.utils.column_info

Updates to `dfencoder`

Given an input dataframe, dfencoder currently does the following 2 things: 1) Setup the dataframe schema based on the column types and values and 2) Build the structure of the auto encoder model from that schema. The schema should be replaced with NVTabular (most likely just the updates to morpheus.utils.column_info) and then build the model from the NVT schema.

Move dfencoder from it's own repository into Morpheus
1. Copy the dfencoder source files into a new submodule: morpheus.models.dfencoder
  1. Make a copy of these files, as-is, called morpheus.models.dfencoder_old
    1. This is to allow side-by-side testing. They will be removed at the end.
2. Deprecate the existing repository
Update DFEncoder data ingestion and ETL to use NVTabular.
1. Update dfencoder.AutoEncoder.init_features()
  1. init_features takes a sample dataframe and determines the schema from the DataFrame's column types
  2. Columns are put into 1 of 3 buckets: categorical, numerical and binary
  3. This should be replaced to use NVTabular's schema and improve on the number of column types available
  4. The ability should be added to dfencoder.AutoEncoder to allow manual overriding of the DataFrame schema (i.e. Supply specific operations or an entire schema which would bypass init_features all together)
2. Update dfencoder.AutoEncoder.build_model()
  1. build_model takes the determined schema from a sample DataFrame and builds the auto encoder model from this schema.
  2. This requires mapping the schema to specific PyTorch functions before concatting everthing together for the core auto encoder layers.
  3. We would need to update the current code to map from NVTabular schema instead of the current system
  4. We should look at other models in Merlin for examples on how to do this
3. Update dfencoder.AutoEncoder.prepare_df()
  1. prepare_df takes an input DataFrame and runs it through the preprocesser to get the final input before passing it to PyTorch
  2. This will need to be replaced with a nvt.Workflow that takes the schema determined from the init_features function
  3. We should look at speeding this up with parallelization or sharding across GPUs if needed.
Update DFEncoder training loop to use NVTabular
1. Update dfencoder.AutoEncoder.fit()
  1. fit currently calls build_model on first use, then prepare_df before finally running a simple PyTorch training loop over batches. This has a few downsides: 1) While the training loop is batched, the ETL and validation steps are not and 2) The loop is poorly implemented which makes extending/improving difficult.
  2. The loop needs to use data loaders to allow for batched processing of the ETL, training and validation parts to reduce memory consumption.
  3. Reference the NVTabular documentation for running a training loop with a data loader. This should be the foundation of the dfencoder training loop
Add tests to DFEncoder
1. Start with sample tests representative of the DFP type workloads (i.e. use DFP sample data) that run using the existing DFEncoder in the separate repo
2. Run all those same tests using the new implementation, validating the output against the old code
3. Add new tests for any additional features that were added above and beyond due to NVT

Updates to DFP

Once Morpheus and DFEncoder have been updated to use NVTabular, there will need to be updates to the DFP pipelines to match the changes and take advantage of the latest features.

Update the Azure, Duo and other DFP workflows with the proper schema using NVT
1. Assuming much of the API is backwards compatible fromt the changes to morpheus.utils.column_info, the number of changes should be small
2. Will likely need to add tags and other metadata that we couldnt do before
Update pipeline to use tagged metadata for determining specific columns.
1. Currently we have a lot of properties called userid_column_name or datetime_column_name which should just pick this info up from the schema metadata
Switch the DFP training loop to use the new DFEncoder classes inside of Morpheus.
1. Again, assuming the class is mostly backwards compatible, this should mostly be a change to the import.
Validate the DFP pipelines using the new classes
Update DFP documentation with changes from the current implementation

Additional Features

Multi-GPU

Once a baseline version of the DFEncoder model has been created and validated, we could make a new class MultiGpuAutoEncoder, which derives from AutoEncoder but runs the training loop assuming multiple GPUs are available.

Create MultiGpuAutoEncoder deriving from AutoEncoder
Override the base functions as necessary to perform the training across multiple GPUs.
1. This may involve overriding one or more functions depending on how multi-GPU training works in PyTorch
2. Additional functionality may need to be pulled out of the base training loop so it can be used by the derived class as well
Add multi GPU training tests
1. TBD on how this would be run in CI
Update documentation with examples on how to train using multi-GPU

WAG estimate on LOE: 2-3 weeks.

mdemoret-nv · 2023-07-20T20:30:46Z

Moving tracking issue to next release

mdemoret-nv · 2023-12-07T19:23:17Z

Deprioritizing

BartleyR added the feature request New feature or request label Dec 6, 2022

BartleyR mentioned this issue Feb 21, 2023

[FEA]: Replace dfencoder in the autoencoder pipeline with NVTabular #615

Closed

2 tasks

mdemoret-nv mentioned this issue Mar 9, 2023

Migrate dfencoder into the Morpheus repo #753

Closed

jarmak-nv changed the title ~~[FEA]: Replace dfencoder with NVTabular~~ [EPIC]: Replace dfencoder with NVTabular Apr 6, 2023

jarmak-nv assigned drobison00 Apr 11, 2023

dagardner-nv mentioned this issue Jul 6, 2023

[BUG]: test_dfencoder_distributed_e2e intemittently fails with a raised ProcessRaisedException #1021

Closed

2 tasks

mdemoret-nv mentioned this issue Aug 21, 2023

[EPIC] Digital Fingerprinting (DFP) Improvements #1141

Open

mdemoret-nv added the dfp [Workflow] Related to the Digital Fingerprinting (DFP) workflow label Aug 21, 2023

mdemoret-nv added this to the 23.11 - DFP Improvements milestone Aug 21, 2023

mdemoret-nv removed this from the 23.11 - DFP Improvements milestone Dec 7, 2023

mdemoret-nv added the tracking Indicates this issue is a tracking issue with a task list referencing other issues label Dec 13, 2023

mdemoret-nv closed this as completed Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC]: Replace dfencoder with NVTabular #517

[EPIC]: Replace dfencoder with NVTabular #517

BartleyR commented Dec 6, 2022 •

edited by mdemoret-nv

Loading

Tasks

mdemoret-nv commented Feb 21, 2023

mdemoret-nv commented Jul 20, 2023

mdemoret-nv commented Dec 7, 2023

[EPIC]: Replace dfencoder with NVTabular #517

[EPIC]: Replace dfencoder with NVTabular #517

Comments

BartleyR commented Dec 6, 2022 • edited by mdemoret-nv Loading

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem this feature solves

Describe your ideal solution

Tasks

Describe any alternatives you have considered

Additional context

Code of Conduct

mdemoret-nv commented Feb 21, 2023

Updates to Morpheus core

Updates to dfencoder

Updates to DFP

Additional Features

Multi-GPU

mdemoret-nv commented Jul 20, 2023

mdemoret-nv commented Dec 7, 2023

BartleyR commented Dec 6, 2022 •

edited by mdemoret-nv

Loading

Updates to `dfencoder`