-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Models with no ZIG? #57
Comments
It could be resolved. To facilitate troubleshooting, could you share the rendered pruning dependency graph or the onnx model you edited. We actually meet one similar case when employing on StyleGANv2. The root cause there is due to all basic modules beingcustomized including linear and conv stylegan's customized operator, which are not covered by OTO's operator list. The issue was resolved after rewriting all customized conv and linear layers back to standard ones. But the root cause may be different. |
@tianyic Thanks for the reply. By pruning dependency graph, you mean i should call ONNX doesn't support STFT and l1_loss, so i changed used STFT outputs as the inputs and changed l1_loss to MSE loss. Must all the operators in my model be in the OTO operator list? |
You can use oto.visualize(view=False, out_dir=PATH, display_params=True) to get render pruning dependency graph. The nn.Modules in the target DNN does not need to be all included by the OTO operator list, yet needs to be composed by the basic modules or composed modules shown in OTO operators list. Of course, people could add new basic modules too. |
I have almost 9000 nodes... Is there a way to analyse the dependency file directly? |
It could be resolved. We meet similar issue when tackling new DNNs with complicated structures. Indeed, Please consider my following suggestions that are frequently used by us to support new DNNs.
oto_encoder = OTO(model.encoder, dummy_input_encoder)
oto_encoder.visualize()
optimizer_encoder = oto_encoder.some_optimizer()
oto_decoder = OTO(model.decoder, dummy_input_decoder)
oto_decoder.visualize()
optimizer_decoder = oto_decoder.some_optimizer() |
@tianyic Thanks for the suggestions. I re-ran it with batch size 2 for the dummy input and got this graph. It looks like the entire graph is one group and every node has dashed lines, meaning unprunable. What should i do? |
Good to know that you got the graph, which helps troubleshooting a lot. I feel this DNN (a transformer) should be smoothly supported after properly adding (at-most) two operators in operator.py. Please consider my following suggestions.
BASIC_MODULES = {
'ConvTranspose2d': ConvTranspose2dOTO,
'Conv2d': Conv2dOTO,
'ModulatedConv2d': Conv2dOTO, # For stagelightv2
'EqualLinear': LinearOTO, # For stagelightv2
'Linear': LinearOTO,
'BatchNorm2d': BatchNormOTO,
'InstanceNorm2d': InstanceNormOTO,
'GroupNorm': GroupNormOTO,
'Embedding': EmbeddingOTO,
'LlamaRMSNorm': LayerNormOTO,
'LayerNorm': LayerNormOTO,
'PReLU': PReLUOTO,
# Add the module for conv here, key is module name (capital case sensitive), value is the operator in OTO
'moduleForConv': LinearOTO
}
COMPOSED_MODULES = {
'LlamaAttention': LlamaAttentionOTO,
'SelfAttention': BaseMultiHeadAttentionOTO,
'BertAttention': BertAttentionOTO,
'PhiMHA': PhiAttentionOTO,
'LoraLinear': LoraLinearOTO,
'LoraEmbedding': LoraEmbeddingOTO,
# Multihead-Attention for TTS
'dummyTTSAttention': BaseMultiHeadAttentionOTO
} Please share the re-rendered pruning dependency graph after the above steps, or let me know if question. Lastly, OTO requires the DNN module composed by the module listed in operator.py. If the basic nn.modules in DNN are not covered yet, we need to add them into the operator list. |
I would expect that after completing the above suggestions, the pruning dependency graph would have multiple node groups fulfilled by solid color, i.e., prunable. |
@tianyic Thank you for the suggestions! I can see different colours on the graph now. However, I am getting repeated parameters:
There are 3 categories of duplicate params: (1) in conv layers, (2) in norm layers and (3) attention linear_out. (1) Repeated conv params:
Does this have to do with setting Conv1d => LinearOTO? Maybe i should adapt Conv2d to Conv1d? (2) Repeated LayerNorm params:
(3) Repeated MultiHeadAttention params (only in
|
I have skipped the duplicate params by changing to
This does make OTO run, but i don't know if it's correct or not. |
Glad to hear that you have got some node groups as prunable. Please consider my following suggestions or comments.
|
I simply load the text-to-speech model and pass the dummy input to OTO: from only_train_once import OTO
import pickle
dummy_input_path = args.optim_conf.pop("dummy_input_path")
with open(dummy_input_path, "rb") as f:
dummy_input = pickle.load(f)
for name, tensor in dummy_input.items():
dummy_input[name] = tensor[0:2].to("cuda") # let batch size = 2 to make it work
oto = OTO(model, dummy_input=dummy_input)
optimizers = [oto.hesso(**args.optim_conf)] I have added at https://github.com/tianyic/only_train_once/blob/466aa9d31c19786d8a7aa10701a7b87f655931c0/only_train_once/__init__.py#L46-L49 self.visualize(view=False, out_dir="exp/fs2_oto/graph_3.gv", display_params=True)
params = self._graph.get_param_groups()
import pickle
with open('params.pkl', 'wb') as f:
pickle.dump(list(params), f)
exit() Graph: ESPnetTTSModel_pruning_dependency.pdf The duplicate I am leaving on a trip soon but will add Conv1d and try again when i get back. Thanks for being so helpful! |
Take your time and enjoy your trip. Regarding Regarding weight sharing, due to string decoding reason, the params_no_tensor.txt has some noisy strings on my end preventing me reading clearly. Anyway, if the weight sharing is caused by one nn.module being called multiple times during the forward pass, we typically use OTO on the specific sub-modules rather than the full one. Meanwhile, it is better ensure the optimizer.parameter list covers the full model's parameter lists. Here is one rough example that may help class FullModel(nn.Module):
def __init__(self):
self.encoder = XX
self.decoder = XX
def forward(self, x):
return self.decoder(self.encoder(self.encoder(x)))
oto_encoder = OTO(model.encoder, dummy_input_encoder)
optimizer_encoder = oto_encoder.hesso()
oto_decoder = OTO(model.decoder, dummy_input_decoder)
optimizer_decoder = oto_decoder.hesso()
# optimizer_encoder and optimizer_decoder cover all parameters in FullModel.
# Training as normal but via two optimizers
optimizer_encoder.step()
optimizer_decoder.step()
# After training
oto_encoder.construct_subnet()
oto_decoder.construct_subnet() |
I don't understand what you mean by
Do you mean i should do this? def set_num_groups(self):
self.num_groups = 1 But this looks correct because for param_name in self.name_to_param:
param = self.name_to_param[param_name]
self.num_groups = max(self.num_groups, param.shape[0]) Actually i don't think there is any difference between Conv1dOTO and Conv2dOTO except for |
We are on the same page, yet may refer to different |
I've implemented Conv1D / Conv3D as an extension of Conv2D, but looks like the problem is with LayerNorm and MultiHeadAttention. The repeated params are:
I am not sure about the LayerNorm and am debugging through the graph init to see what's happening. However, for MultiHeadAttention, i think i need to subclass it because there is a final linear layer. But i don't know what to do there.
I didn't explain properly... that was a pickle file renamed to txt as Github does not allow pkl. Here it is in zip form (you have to load it with Minor issue: I have nodes where |
Happy to hear that you implemented Conv1d and Conv3d, that is great. I will look into your case after completing some business stuffs later, and loop back soon. |
I see. Regarding this weight sharing case, I suggest to directly override the modules with the repeating params. For example, repeated_weight, repeated_bias here
model.encoder.encoders.0.self_attn.linear_out = nn.Linear(**kwargs)
model.encoder.encoders.0.self_attn.linear_out.weight.data.copy_(repeated_weight.data)
model.encoder.encoders.0.self_attn.linear_out.bias.data.copy_(repeated_bias.data)
model.encoder.encoders.1.self_attn.linear_out = nn.Linear(**kwargs)
model.encoder.encoders.1.self_attn.linear_out.weight.data.copy_(repeated_weight.data)
model.encoder.encoders.1.self_attn.linear_out.bias.data.copy_(repeated_bias.data) Afterwards, the issue could be resolved, though the parameter sizes might slightly increase before pruning. Another way to resolve it is to make the Regarding FLOPs, yes, please skip that, compute_flops is optional. |
Sorry I can't understand what you mean... should I just copy out all the module data before the graph init and put them back? Also, apologies for the confusion -- there are no shared weights, only shared modules. The encoder and decoder are the same architecture but instantiated separately. encoder = TransformerEncoder(
idim=idim,
attention_dim=adim,
attention_heads=aheads,
linear_units=eunits,
num_blocks=elayers,
input_layer=encoder_input_layer,
dropout_rate=transformer_enc_dropout_rate,
positional_dropout_rate=transformer_enc_positional_dropout_rate,
attention_dropout_rate=transformer_enc_attn_dropout_rate,
pos_enc_class=pos_enc_class,
normalize_before=encoder_normalize_before,
concat_after=encoder_concat_after,
positionwise_layer_type=positionwise_layer_type,
positionwise_conv_kernel_size=positionwise_conv_kernel_size,
)
decoder = TransformerEncoder(
idim=0,
attention_dim=adim,
attention_heads=aheads,
linear_units=dunits,
num_blocks=dlayers,
input_layer=None,
dropout_rate=transformer_dec_dropout_rate,
positional_dropout_rate=transformer_dec_positional_dropout_rate,
attention_dropout_rate=transformer_dec_attn_dropout_rate,
pos_enc_class=pos_enc_class,
normalize_before=decoder_normalize_before,
concat_after=decoder_concat_after,
positionwise_layer_type=positionwise_layer_type,
positionwise_conv_kernel_size=positionwise_conv_kernel_size,
) I see that you have PS: I emailed you, maybe we can discuss on Teams? |
Hi @tianyic ,
I am trying to use OTO on speech models (FastSpeech2) and rewrote parts to make sure all the pytorch ops are supported in ONNX.
However, i found that nothing was pruned. When i run
I get
hesso.total_num_groups = 0
Target redundant groups per period: [0]
Does this mean there are no zero-invariant groups in the model? This is strange, because there are conv layers in transformer encoder/decoder. Reference code
Any help appreciated, thanks!
The text was updated successfully, but these errors were encountered: