-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' of https://github.com/NVIDIA/NeMo into casual_con…
…former_lookahead_newdesign
- Loading branch information
Showing
25 changed files
with
638 additions
and
520 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
32 changes: 32 additions & 0 deletions
32
examples/nlp/language_modeling/conf/megatron_model_base_config.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
num_layers: 12 # For perceiver models, this is the number of cross-attention blocks. Each layer has 1 cross-attention and "num_self_attention_per_cross_attention" self-attention layers. | ||
hidden_size: 768 | ||
ffn_hidden_size: 3072 # Transformer FFN hidden size. Usually 4 * hidden_size. | ||
num_attention_heads: 12 | ||
init_method_std: 0.02 # Standard deviation of the zero mean normal distribution used for weight initialization.') | ||
hidden_dropout: 0.1 # Dropout probability for hidden state transformer. | ||
attention_dropout: 0.1 # Dropout probability in the attention layer. | ||
position_embedding_type: 'learned_absolute' # Position embedding type. Options ['learned_absolute', 'relative'] | ||
relative_attention_num_buckets: 32 # Relative position number of buckets for computing the bias | ||
relative_attention_max_distance: 128 # max_distance to keep relative distance in the attention_num_buckets. | ||
relative_position_bias_self_attention_only: True # whether to only use relative position bias for self attention only. | ||
kv_channels: null # Projection weights dimension in multi-head attention. Set to hidden_size // num_attention_heads if null | ||
apply_query_key_layer_scaling: True # scale Q * K^T by 1 / layer-number. | ||
layernorm_epsilon: 1e-5 | ||
persist_layer_norm: True # Use of persistent fused layer norm kernel. | ||
bias_activation_fusion: True # Use a kernel that fuses the bias addition from weight matrices with the subsequent activation function. | ||
grad_div_ar_fusion: True # Fuse grad division into torch.distributed.all_reduce | ||
masked_softmax_fusion: True # Use a kernel that fuses the attention softmax with it's mask. | ||
bias_dropout_add_fusion: True # Use a kernel that fuses the bias addition, dropout and residual connection addition. | ||
bias: True # Whether to use bias terms in all weight matrices. | ||
normalization: 'layernorm' # Normalization layer to use. Options are 'layernorm', 'rmsnorm' | ||
arch: 'transformer' # Options: ['transformer', 'perceiver'] | ||
activation: 'gelu' # Options ['gelu', 'geglu', 'swiglu', 'reglu'] | ||
headscale: False # Whether to learn extra parameters that scale the output of the each self-attention head. | ||
transformer_block_type: 'pre_ln' # Options ['pre_ln', 'post_ln', 'normformer'] | ||
hidden_steps: 32 # Number of latent vectors to use for pereceiver encoders | ||
num_self_attention_per_cross_attention: 1 # Number of self-attention layers for every cross-attention layer. | ||
openai_gelu: False # Use OpenAI's GELU instead of the default GeLU | ||
onnx_safe: False # Use work-arounds for known problems with Torch ONNX exporter. | ||
fp32_residual_connection: False # Use FP32 for residual connections. | ||
activations_checkpoint_method: null # 'uniform', 'block' | ||
activations_checkpoint_num_layers: 1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.