Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

increase N for experiments on Wu et al 2023 baseline, 3 or 4 layer variants reported in results (report additional runs that had not completed by original draft) #3

Conversation

willy-b
Copy link
Owner

@willy-b willy-b commented Dec 16, 2024

small change: report experiments that had not completed by the original draft deadline, increasing to n=10 trained from Transformers the Wu et al 2023 baseline, 3 or 4 layer variants

draft PDF link: https://raw.githubusercontent.com/willy-b/RASP-for-ReCOGS/fdde271505961a7adf5d6b51b63d2d52108c98f2/rasp-for-recogs_pos-wbruns-2024-draft.pdf

Layers here are Transformer blocks (as we wanted to check if adding more blocks would get it to learn more tree-like representation of grammar and avoid the mistakes predicted by non-tree/non-recursive solution reported by Wu et al 2023 and confirmed by me).

Specifically, when I say 4 layers, I mean 4 x BertLayer in the Encoder and Decoder.
Here is Wu et al 2023's baseline Transformer in that configuration:


EncoderDecoderModel(
  (encoder): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(762, 300, padding_idx=0)
      (position_embeddings): Embedding(512, 300)
      (token_type_embeddings): Embedding(2, 300)
      (LayerNorm): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-3): 4 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=300, out_features=300, bias=True)
              (key): Linear(in_features=300, out_features=300, bias=True)
              (value): Linear(in_features=300, out_features=300, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=300, out_features=300, bias=True)
              (LayerNorm): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=300, out_features=512, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): BertOutput(
            (dense): Linear(in_features=512, out_features=300, bias=True)
            (LayerNorm): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=300, out_features=300, bias=True)
      (activation): Tanh()
    )
  )
  (decoder): BertLMHeadModel(
    (bert): BertModel(
      (embeddings): BertEmbeddings(
        (word_embeddings): Embedding(729, 300, padding_idx=0)
        (position_embeddings): Embedding(512, 300)
        (token_type_embeddings): Embedding(2, 300)
        (LayerNorm): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0-3): 4 x BertLayer(
            (attention): BertAttention(
              (self): BertSdpaSelfAttention(
                (query): Linear(in_features=300, out_features=300, bias=True)
                (key): Linear(in_features=300, out_features=300, bias=True)
                (value): Linear(in_features=300, out_features=300, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=300, out_features=300, bias=True)
                (LayerNorm): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (crossattention): BertAttention(
              (self): BertSdpaSelfAttention(
                (query): Linear(in_features=300, out_features=300, bias=True)
                (key): Linear(in_features=300, out_features=300, bias=True)
                (value): Linear(in_features=300, out_features=300, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=300, out_features=300, bias=True)
                (LayerNorm): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (intermediate): BertIntermediate(
              (dense): Linear(in_features=300, out_features=512, bias=True)
              (intermediate_act_fn): GELUActivation()
            )
            (output): BertOutput(
              (dense): Linear(in_features=512, out_features=300, bias=True)
              (LayerNorm): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
      )
    )
    (cls): BertOnlyMLMHead(
      (predictions): BertLMPredictionHead(
        (transform): BertPredictionHeadTransform(
          (dense): Linear(in_features=300, out_features=300, bias=True)
          (transform_act_fn): GELUActivation()
          (LayerNorm): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
        )
        (decoder): Linear(in_features=300, out_features=729, bias=True)
      )
    )
  )
)

…ine, increasing to n=10 trained from Transformers the Wu et al 2023 baseline, 3 or 4 layer variants
…formers from scratch that we predict and analyze errors of and compare the Restricted Access Sequence Processing model to

We already noted:
```
For training Transformers from scratch with randomly initialized weights using gradient descent for comparison with RASP predictions, we use scripts derived from those provided by \cite{Wu2023}\footnote{
\begin{tiny}
https://github.com/frankaging/ReCOGS/blob/
1b6eca8ff4dca5fd2fb284a7d470998af5083beb/run\_cogs.py

and

https://github.com/frankaging/ReCOGS/blob/
1b6eca8ff4dca5fd2fb284a7d470998af5083beb/model/encoder\_decoder\_hf.py
\end{tiny}
}.
```

But now let us add a snapshot in the appendix of the actual model architecture generated by their script for easy reference and to make sure there is no confusion about what we are varying in our depth variation experiments (3, 4 block/layer versions compared to Wu et al 2023s 2 block/layer).
@willy-b willy-b marked this pull request as ready for review December 17, 2024 01:56
@willy-b willy-b merged commit 79c4544 into main Dec 17, 2024
@willy-b willy-b deleted the report-experiments-that-completed-after-initial-draft-deadline-for-wu-et-al-2023-baseline-transformer-with-3-or-4-layers branch December 17, 2024 14:22
@willy-b
Copy link
Owner Author

willy-b commented Dec 17, 2024

Note, for comparing with the Wu et al 2023 baseline Transformer provided in https://github.com/frankaging/ReCOGS, for the time being one should run !pip install transformers==v4.45.2 in any notebooks before running their repo as there was a breaking change sometime between then and the version v4.46.3 of huggingface.co's Transformers that affects their repository. (This does NOT affect the RASP model.)

See example repro of the issue while running the default run_cogs.py arguments (per https://github.com/frankaging/ReCOGS/blob/1b6eca8ff4dca5fd2fb284a7d470998af5083beb/README.md?plain=1#L75 )

at https://colab.research.google.com/drive/1pv4tqu4XunBMwyfPF43T8omkUhN3pkYB .

@willy-b
Copy link
Owner Author

willy-b commented Dec 17, 2024

Reported the Wu et al 2023 baseline Transformer incompatibility with HF Transformers v4.46.3 requiring downgrade to pip install transformers==v4.45.2 back to that project at frankaging/ReCOGS#1 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant