Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parameters] How to Fix Graph Size Exceeding 2GB in DeepMD-kit Model Compression #4588

Open
xwsci opened this issue Feb 9, 2025 · 2 comments

Comments

@xwsci
Copy link

xwsci commented Feb 9, 2025

Summary

I trained my model using
dp train input.json
Then, I attempted to freeze and compress the model using:

dp freeze -o graph.pb  
dp compress -i graph.pb -o graph-compress.pb  

However, I encountered the following error:

Traceback (most recent call last):
  File "/opt/deepmd-kit/bin/dp", line 10, in <module>
    sys.exit(main())
  File "/opt/deepmd-kit/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 483, in main
    compress(**dict_args)
  File "/opt/deepmd-kit/lib/python3.10/site-packages/deepmd/entrypoints/compress.py", line 144, in compress
    raise RuntimeError(
RuntimeError: The uniform step size of the tabulation's first table is 0.010000, which is too small. This leads to a very large graph size, exceeding protobuf's limitation (2 GB). You should try to increase the step size.

For more error message, please see the details below. Could you please guide me how to solve this issue?

Detailed Description

I trained my model using
dp train input.json
Then, I attempted to freeze and compress the model using:

dp freeze -o graph.pb  
dp compress -i graph.pb -o graph-compress.pb  

However, I encountered the following error:

DEEPMD INFO    stage 2: freeze the model
Traceback (most recent call last):
  File "/opt/deepmd-kit/lib/python3.10/site-packages/deepmd/entrypoints/freeze.py", line 206, in freeze
    input_graph_def = graph.as_graph_def()
  File "/opt/deepmd-kit/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 3568, in as_graph_def
    result, _ = self._as_graph_def(from_version, add_shapes)
  File "/opt/deepmd-kit/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 3482, in _as_graph_def
    graph.ParseFromString(compat.as_bytes(data))
google.protobuf.message.DecodeError: Error parsing message as the message exceeded the protobuf limit with type 'tensorflow.GraphDef'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/deepmd-kit/lib/python3.10/site-packages/deepmd/entrypoints/compress.py", line 142, in compress
    freeze(checkpoint_folder=checkpoint_folder, output=output, node_names=None)
  File "/opt/deepmd-kit/lib/python3.10/site-packages/deepmd/entrypoints/freeze.py", line 208, in freeze
    raise GraphTooLargeError(
deepmd.utils.errors.GraphTooLargeError: The graph size exceeds 2 GB, the hard limitation of protobuf. Then a DecodeError was raised by protobuf. You should reduce the size of your model.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/deepmd-kit/bin/dp", line 10, in <module>
    sys.exit(main())
  File "/opt/deepmd-kit/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 483, in main
    compress(**dict_args)
  File "/opt/deepmd-kit/lib/python3.10/site-packages/deepmd/entrypoints/compress.py", line 144, in compress
    raise RuntimeError(
RuntimeError: The uniform step size of the tabulation's first table is 0.010000, which is too small. This leads to a very large graph size, exceeding protobuf's limitation (2 GB). You should try to increase the step size.

The input.json file is attached below. Could you please guide me how to solve this issue?

input (2).json

Further Information, Files, and Links

No response

@wanghan-iapcm
Copy link
Collaborator

As the error message said "The uniform step size of the tabulation's first table is 0.010000, which is too small. ..... You should try to increase the step size"

dp compress --step STEP

@njzjz
Copy link
Member

njzjz commented Feb 10, 2025

You have 7 elements and set type_one_side to its default value False, leading to 49 neural networks in the descriptor. It's better to use the compressible DPA-1 (layer=0).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants