increase N for experiments on Wu et al 2023 baseline, 3 or 4 layer variants reported in results (report additional runs that had not completed by original draft) #3

willy-b · 2024-12-16T18:06:05Z

small change: report experiments that had not completed by the original draft deadline, increasing to n=10 trained from Transformers the Wu et al 2023 baseline, 3 or 4 layer variants

draft PDF link: https://raw.githubusercontent.com/willy-b/RASP-for-ReCOGS/fdde271505961a7adf5d6b51b63d2d52108c98f2/rasp-for-recogs_pos-wbruns-2024-draft.pdf

Layers here are Transformer blocks (as we wanted to check if adding more blocks would get it to learn more tree-like representation of grammar and avoid the mistakes predicted by non-tree/non-recursive solution reported by Wu et al 2023 and confirmed by me).

Specifically, when I say 4 layers, I mean 4 x BertLayer in the Encoder and Decoder.
Here is Wu et al 2023's baseline Transformer in that configuration:


EncoderDecoderModel(
  (encoder): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(762, 300, padding_idx=0)
      (position_embeddings): Embedding(512, 300)
      (token_type_embeddings): Embedding(2, 300)
      (LayerNorm): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-3): 4 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=300, out_features=300, bias=True)
              (key): Linear(in_features=300, out_features=300, bias=True)
              (value): Linear(in_features=300, out_features=300, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=300, out_features=300, bias=True)
              (LayerNorm): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=300, out_features=512, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): BertOutput(
            (dense): Linear(in_features=512, out_features=300, bias=True)
            (LayerNorm): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=300, out_features=300, bias=True)
      (activation): Tanh()
    )
  )
  (decoder): BertLMHeadModel(
    (bert): BertModel(
      (embeddings): BertEmbeddings(
        (word_embeddings): Embedding(729, 300, padding_idx=0)
        (position_embeddings): Embedding(512, 300)
        (token_type_embeddings): Embedding(2, 300)
        (LayerNorm): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0-3): 4 x BertLayer(
            (attention): BertAttention(
              (self): BertSdpaSelfAttention(
                (query): Linear(in_features=300, out_features=300, bias=True)
                (key): Linear(in_features=300, out_features=300, bias=True)
                (value): Linear(in_features=300, out_features=300, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=300, out_features=300, bias=True)
                (LayerNorm): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (crossattention): BertAttention(
              (self): BertSdpaSelfAttention(
                (query): Linear(in_features=300, out_features=300, bias=True)
                (key): Linear(in_features=300, out_features=300, bias=True)
                (value): Linear(in_features=300, out_features=300, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=300, out_features=300, bias=True)
                (LayerNorm): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (intermediate): BertIntermediate(
              (dense): Linear(in_features=300, out_features=512, bias=True)
              (intermediate_act_fn): GELUActivation()
            )
            (output): BertOutput(
              (dense): Linear(in_features=512, out_features=300, bias=True)
              (LayerNorm): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
      )
    )
    (cls): BertOnlyMLMHead(
      (predictions): BertLMPredictionHead(
        (transform): BertPredictionHeadTransform(
          (dense): Linear(in_features=300, out_features=300, bias=True)
          (transform_act_fn): GELUActivation()
          (LayerNorm): LayerNorm((300,), eps=1e-12, elementwise_affine=True)
        )
        (decoder): Linear(in_features=300, out_features=729, bias=True)
      )
    )
  )
)

…ine, increasing to n=10 trained from Transformers the Wu et al 2023 baseline, 3 or 4 layer variants

…formers from scratch that we predict and analyze errors of and compare the Restricted Access Sequence Processing model to We already noted: ``` For training Transformers from scratch with randomly initialized weights using gradient descent for comparison with RASP predictions, we use scripts derived from those provided by \cite{Wu2023}\footnote{ \begin{tiny} https://github.com/frankaging/ReCOGS/blob/ 1b6eca8ff4dca5fd2fb284a7d470998af5083beb/run\_cogs.py and https://github.com/frankaging/ReCOGS/blob/ 1b6eca8ff4dca5fd2fb284a7d470998af5083beb/model/encoder\_decoder\_hf.py \end{tiny} }. ``` But now let us add a snapshot in the appendix of the actual model architecture generated by their script for easy reference and to make sure there is no confusion about what we are varying in our depth variation experiments (3, 4 block/layer versions compared to Wu et al 2023s 2 block/layer).

willy-b · 2024-12-17T17:47:58Z

Note, for comparing with the Wu et al 2023 baseline Transformer provided in https://github.com/frankaging/ReCOGS, for the time being one should run !pip install transformers==v4.45.2 in any notebooks before running their repo as there was a breaking change sometime between then and the version v4.46.3 of huggingface.co's Transformers that affects their repository. (This does NOT affect the RASP model.)

See example repro of the issue while running the default run_cogs.py arguments (per https://github.com/frankaging/ReCOGS/blob/1b6eca8ff4dca5fd2fb284a7d470998af5083beb/README.md?plain=1#L75 )

at https://colab.research.google.com/drive/1pv4tqu4XunBMwyfPF43T8omkUhN3pkYB .

willy-b · 2024-12-17T18:07:48Z

Reported the Wu et al 2023 baseline Transformer incompatibility with HF Transformers v4.46.3 requiring downgrade to pip install transformers==v4.45.2 back to that project at frankaging/ReCOGS#1 .

willy-b added 2 commits December 16, 2024 13:03

report experiments that had not completed by the original draft deadl…

72dff30

…ine, increasing to n=10 trained from Transformers the Wu et al 2023 baseline, 3 or 4 layer variants

willy-b marked this pull request as ready for review December 17, 2024 01:56

willy-b merged commit 79c4544 into main Dec 17, 2024

willy-b deleted the report-experiments-that-completed-after-initial-draft-deadline-for-wu-et-al-2023-baseline-transformer-with-3-or-4-layers branch December 17, 2024 14:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

increase N for experiments on Wu et al 2023 baseline, 3 or 4 layer variants reported in results (report additional runs that had not completed by original draft) #3

increase N for experiments on Wu et al 2023 baseline, 3 or 4 layer variants reported in results (report additional runs that had not completed by original draft) #3

willy-b commented Dec 16, 2024 •

edited

Loading

willy-b commented Dec 17, 2024

willy-b commented Dec 17, 2024

increase N for experiments on Wu et al 2023 baseline, 3 or 4 layer variants reported in results (report additional runs that had not completed by original draft) #3

increase N for experiments on Wu et al 2023 baseline, 3 or 4 layer variants reported in results (report additional runs that had not completed by original draft) #3

Conversation

willy-b commented Dec 16, 2024 • edited Loading

willy-b commented Dec 17, 2024

willy-b commented Dec 17, 2024

willy-b commented Dec 16, 2024 •

edited

Loading