Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR - tf2onnx.tfonnx: Failed to convert node 'StatefulPartitionedCall/functional_1/layer_normalization/FusedBatchNormV3' #1175

Closed
HarryAA opened this issue Nov 11, 2020 · 19 comments · Fixed by #1249

Comments

@HarryAA
Copy link

HarryAA commented Nov 11, 2020

Describe the bug
When trying to convert a tensorflow model containing a tf.keras.layers.LayerNormalization layer, the conversion to onnx fails. This occurs at the FusedBatchNormV3 node when attempting to resize the mean input to have the same shape as the scale input. This produces a ValueError: negative dimensions are not allowed.

Urgency
None.

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Running in Colab
  • Tensorflow Version: 2.3.0
  • Python version: 3.6.9

To Reproduce
Describe steps/code to reproduce the behavior:
Create a model containing tf.keras.layers.LayerNormalization layer such as:

class DenseBlock(tf.keras.Model):
  def __init__(self, input_size, depth=5, in_channels=64):
    super(DenseBlock, self).__init__(name='')
    self.depth = depth
    self.in_channels = in_channels
    self.pad = tf.keras.layers.ZeroPadding2D(((1, 0), (1, 1)))
    self.twidth = 2
    self.kernel_size = (self.twidth, 3)
    for i in range(self.depth):
      dil = 2**i
      pad_length = self.twidth + (dil-1)*(self.twidth-1)-1
      setattr(self, 'pad{}'.format(i+1), tf.keras.layers.ZeroPadding2D(((pad_length, 0), (1, 1))))
      setattr(self, 'conv{}'.format(i+1), tf.keras.layers.Conv2D(filters=self.in_channels, kernel_size=self.kernel_size, dilation_rate=(dil, 1)))
      setattr(self, 'norm{}'.format(i+1), tf.keras.layers.LayerNormalization(axis=-1))
      setattr(self, 'prelu{}'.format(i+1), tf.keras.layers.PReLU(shared_axes=[1, 2]))

  def call(self, input_tensor):
    skip = input_tensor
    for i in range(self.depth):
      x = getattr(self, 'pad{}'.format(i+1))(skip)
      x = getattr(self, 'conv{}'.format(i+1))(x)
      x = getattr(self, 'norm{}'.format(i+1))(x)
      x = getattr(self, 'prelu{}'.format(i+1))(x)
      skip = tf.concat((x, skip), axis=3)
    return x

input = tf.keras.layers.Input(shape=(None, 30, 1))
x = DenseBlock(30, 5, 64)(input)
output = tf.keras.layers.Dense(units=1)(x)
model = tf.keras.Model(inputs=input, outputs=output)
model.compile()
model.summary()

Then save model and attempt to convert using tf2onnx.convert and it will fail.

Screenshots

    'OP=BatchNormalization\n' \
    'Name=StatefulPartitionedCall/functional_1/layer_normalization/FusedBatchNormV3\n' \
    'Inputs:\n' \
    '\tStatefulPartitionedCall/functional_1/layer_normalization/Reshape:0=Reshape, [1, -1, -1, 1], 1\n' \
    '\tStatefulPartitionedCall/functional_1/layer_normalization/Fill:0=Expand, [-1], 1\n' \
    '\tStatefulPartitionedCall/functional_1/layer_normalization/Fill_1:0=Expand, [-1], 1\n' \
    '\tStatefulPartitionedCall/functional_1/layer_normalization/Const_2:0=Const, [0], 1\n' \
    '\tStatefulPartitionedCall/functional_1/layer_normalization/Const_3:0=Const, [0], 1\n' \
    'Outpus:\n' \
    '\tStatefulPartitionedCall/functional_1/layer_normalization/FusedBatchNormV3:0=[1, -1, -1, 1], 1'
    '''
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/tf2onnx/tfonnx.py", line 287, in tensorflow_onnx_mapping
        func(g, node, **kwargs)
      File "/usr/local/lib/python3.6/dist-packages/tf2onnx/onnx_opset/nn.py", line 796, in version_9
        cls.version_6(ctx, node, **kwargs)
      File "/usr/local/lib/python3.6/dist-packages/tf2onnx/onnx_opset/nn.py", line 780, in version_6
        new_mean_value = np.array(np.resize(node.inputs[3].get_tensor_value(as_list=False), scale_shape),
      File "<__array_function__ internals>", line 6, in resize
      File "/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py", line 1406, in resize
        return mu.zeros(new_shape, a.dtype)'
       ValueError: negative dimensions are not allowed
    '''

image

Additional context
I have included a screenshot of tensorboard showing the node in question

@HarryAA
Copy link
Author

HarryAA commented Nov 11, 2020

Looking at the source code in tensorflow the layer does the following on lines 1217-1265

# Collapse dims before self.axis, and dims in self.axis
      pre_dim, in_dim = (1, 1)
      axis = sorted(self.axis)
      tensor_shape = array_ops.shape(inputs)
      for dim in range(0, ndims):
        dim_tensor = tensor_shape[dim]
        if dim < axis[0]:
          pre_dim = pre_dim * dim_tensor
        else:
          assert dim in axis
          in_dim = in_dim * dim_tensor

      squeezed_shape = [1, pre_dim, in_dim, 1]
      # This fused operation requires reshaped inputs to be NCHW.
      data_format = 'NCHW'

      inputs = array_ops.reshape(inputs, squeezed_shape)

      def _set_const_tensor(val, dtype, shape):
        return array_ops.fill(shape, constant_op.constant(val, dtype=dtype))

      # self.gamma and self.beta have the wrong shape for fused_batch_norm, so
      # we cannot pass them as the scale and offset parameters. Therefore, we
      # create two constant tensors in correct shapes for fused_batch_norm and
      # later construct a separate calculation on the scale and offset.
      scale = _set_const_tensor(1.0, self.dtype, [pre_dim])
      offset = _set_const_tensor(0.0, self.dtype, [pre_dim])

      # Compute layer normalization using the fused_batch_norm function.
      outputs, _, _ = nn.fused_batch_norm(
          inputs,
          scale=scale,
          offset=offset,
          epsilon=self.epsilon,
          data_format=data_format)

      outputs = array_ops.reshape(outputs, tensor_shape)

      scale, offset = _broadcast(self.gamma), _broadcast(self.beta)

      if scale is not None:
        outputs = outputs * math_ops.cast(scale, outputs.dtype)
      if offset is not None:
        outputs = outputs + math_ops.cast(offset, outputs.dtype)

    # If some components of the shape got lost due to adjustments, fix that.
    outputs.set_shape(input_shape)

    return outputs

The comment suggests that two placeholder tensors are created with the correct shape to be able to call nn.fused_batch_norm. Note that the conversion works using the non-fused batch normalisation inside the LayerNormalization layer. So is this a feature that is just not supported in the current version of tf2onnx?

@TomWildenhain-Microsoft
Copy link
Contributor

tf2onnx only supports converting models for inference, not training. I think the above tf behavior is only needed during training, since the scale/offset are constant for inference models.

@HarryAA
Copy link
Author

HarryAA commented Nov 12, 2020

Hi @TomWildenhain-Microsoft, thanks for the reply. If that is indeed the case, is it possible for tf2onnx to skip these layers? I only want the onnx model for inference but the saved model obviously has these layers in from training.

@TomWildenhain-Microsoft
Copy link
Contributor

This is my current understanding, but @guschmue correct me if any of this is wrong:

BatchNorm has a property "is_training" which can be true or false. When true, the mean and variance values are computed dynamically during training. When false, they are frozen and stored in the op. If is_training is true when we try to convert the model, the mean and variance values are empty so we don't have enough information to run inference. You can pass the --output_frozen_graph flag to the converter (with a path ending in .pb) to see the inference graph we are converting. In this case, the batchnorm ops have is_training set to true and no mean/variance values:

image

To convert a keras model with batch normalization, you must set the layer to trainable=false before saving the model. I've been trying to do this with the model you provided but keep getting trainable=true despite using set_learning_phase(0) before saving. Might be related to this issue:

keras-team/keras#4762

@guschmue do you know how to set is_training to false on a batch norm layer in Keras?

@HarryAA
Copy link
Author

HarryAA commented Nov 13, 2020

I also observe a similar problem when I set the LayerNormalization layers to Trainable=False at training time, meaning I just use fixed normalisation parameters for these layers, when I load the saved model back up again the layers appear to have training=True as all parameters are learnable again. I don't know if this is intended behaviour or not.

However it is worth noting that this problem doesn't occur at all for normal BatchNormalization inside the LayerNormalization layer. This can be achieved by setting epsilon < 1e-5 according to the comments in the tensorflow source code normalization.py lines 1120-1125:

# fused_batch_norm will silently raise epsilon to be at least 1.001e-5, so
# we cannot used the fused version if epsilon is below that value. Also, the
# variable dtype must be float32, as fused_batch_norm only supports float32
# variables.
if self.epsilon < 1.001e-5 or self.dtype != 'float32':
  can_use_fused = False

The layer then calls batch_normalization() instead of fused_batch_norm() and this is then converted successfully with tf2onnx.convert.

@TomWildenhain-Microsoft
Copy link
Contributor

@HarryAA thanks for looking into this. I've recently improved our messaging for when training is set to true (tf2onnx will display a warning but continue conversion). Ideally models should have training set to false when we convert them, but it seems like there's a bug in TF/keras that isn't setting that correctly on this op. Can you try converting with the latest tf2onnx from master and see if it fixes your issue?

pip uninstall tf2onnx
pip install git+https://github.com/onnx/tensorflow-onnx

@HarryAA
Copy link
Author

HarryAA commented Nov 26, 2020

Thanks for the update @TomWildenhain-Microsoft , sorry for the delay. I have tried the latest version of onnx as you suggested and it does indeed convert the model successfully. However when I load the onnx model into a runtime session I get the following error:

onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running BatchNormalization node. Name:'StatefulPartitionedCall/functional_1/dense_block/layer_normalization/FusedBatchNormV3' Status Message: Invalid input mean: 0th dimension != 37120

Which suggests to me that this fix maybe just avoids the underlying issue instead of fixing it?

@TomWildenhain-Microsoft
Copy link
Contributor

Which suggests to me that this fix maybe just avoids the underlying issue instead of fixing it?

@HarryAA Yes I think you are right about that. My current conclusion is that there is a bug in Keras/TF that makes it not properly set the training value to false. If the training value is true, tf2onnx has an error. The best solution would be to work around the keras bug to get training to be false. If training is true, I'm not sure there is much we can do to convert the model. We can't just skip the layer, since I think the values produced would be incorrect. Does that seem right?

@wangqiaoshi
Copy link
Contributor

I also encountered the same problem,from OpenNMT-tf, the SequenceToSequence Model.

@HarryAA
Copy link
Author

HarryAA commented Dec 10, 2020

Does that seem right?

@TomWildenhain-Microsoft Yes I think it does. The layers definitely need to be included in the conversion because the learnt parameters are used at inference. I have a work around that prevents me from using the FusedBatchNorm and this works for me but it certainly isn't a general solution. I am not sure if the bug in tensorflow is fixed and we are able to set training=False that this problem will actually go away though

@TomWildenhain-Microsoft
Copy link
Contributor

I am not sure if the bug in tensorflow is fixed and we are able to set training=False that this problem will actually go away though

I'm pretty sure it will. Setting training to false also mains TF will include additional data in the op that we need (mean and variance). When training = true, these values are left blank.

https://www.tensorflow.org/api_docs/python/tf/raw_ops/FusedBatchNormV3

@HarryAA
Copy link
Author

HarryAA commented Dec 18, 2020

Okay, that makes sense. So this is an issue that should be raised in the TF repo. Has this been done? Until then this issue can't be fixed.

@TomWildenhain-Microsoft
Copy link
Contributor

I believe I have seen this issue raised before, but I can't find it... would be worth raising again. My only other question is how TF is able to run inference at all without these values. If it computes the mean/variance from the current sample, we may be able to do the same in this case. If it is using some sort of rolling average based on previous inferences, then there really is nothing we can do since ONNX inference is stateless. As a workaround, I think I saw someone say you can "hack" the training value to false by accessing some private properties. Not sure where I saw that... In any case, any more research you make in this area would be appreciated.

@HarryAA
Copy link
Author

HarryAA commented Dec 18, 2020

Okay sure, I will take a look a bit closer at what TF is doing for this when performing inference. I'll try and avoid hacky workarounds for the time being and see if I can fix the underlying issue. Cheers.

@TomWildenhain-Microsoft
Copy link
Contributor

I took a deep look at this issue today and here's what I found:

  1. The TF spec claims FusedBatchNormV3 will have empty mean/variance exactly when is_training is true. This is incorrect. Rather, mean/variance are empty if and only if training is true AND exponential_avg_factor is 1. In this case, the mean/variance are entirely determined by the values of the input data and all historical values are irrelevant.
  2. TF sometimes uses a FusedBatchNormV3 op in training mode to emulate the behavior of a different op (LayerNormalization) in test mode. Layer normalization is supposed to get the mean/variance from the inputs in test mode, so this is fine. In these cases, the mean/variance are empty and exponential_avg_factor is 1.
  3. The ONNX spec always uses the provided mean/variance for its BatchNormalization op when in test mode. It does not compute them from the input.

The solution is to insert ops for computing the mean/variance when a FusedBatchNormV3 is found to be in training mode and missing those values. I've done this in #1249 which will hopefully fix your issue. On the example model you provided, it does produce the correct answer.

@HarryAA
Copy link
Author

HarryAA commented Jan 4, 2021

Thanks very much for fixing @TomWildenhain-Microsoft !

@romil611
Copy link

Hi @TomWildenhain-Microsoft , I was using tf2onnx 1.8.1 to generate the onnx for automl efficientdet saved model. I'm still getting a lot of warnings regarding the FusedBatchNormV3
WARNING:tf2onnx.onnx_opset.nn:Node box_net/box-0-bn-6/FusedBatchNormV3 of type FusedBatchNormV3 has is_training set to true, which is not supperted. Please re-save the model with training set to false.
Do you have any suggestions on how I can change the is_training parameter for that layer?

@TomWildenhain-Microsoft
Copy link
Contributor

@romil611 have you been able to test whether the onnx model produces the correct results? If so, it's fine to ignore the warnings. Otherwise, please upload a zipped copy of the saved model and the onnx file you are getting. You might be able to set training to false to fix it, but I've had difficulty with that in the past.

@romil611
Copy link

@TomWildenhain-Microsoft I was able to get rid of that warning when I used tf2onnx 1.10.
I was using https://storage.googleapis.com/cloud-tpu-checkpoints/efficientdet/coco2/efficientdet-d4.tar.gz for testing.

Anyways thanks for responding!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants