Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model based on HuggingFace T5 #2

Open
bjascob opened this issue Dec 18, 2020 · 0 comments
Open

Model based on HuggingFace T5 #2

bjascob opened this issue Dec 18, 2020 · 0 comments

Comments

@bjascob
Copy link

bjascob commented Dec 18, 2020

Just FYI in case someone is interested...

I uploaded to amrlib a parse model based on the pre-trained HuggingFace T5-base model. This scores an 81 smatch on LDC2020T02. I can't say I've put much work into optimizing params (ie... # of epochs / choosing the best epoch) and there's also the T5-large model to try, so if someone wants to try to push the SoTA with these, the pretrained transformers are a good place to start.

The format for the graph serialization (model input) can impact the results. Originally, I simply removed all the variables and saw an 82 smatch score but noticed that graphs with multiple nodes of the same name (ie.. 2 different "people" nodes) were merged so I added an _xx to the names to make them unique. This actually reduces the smatch score a bit; I assume because the transformer has trouble with the numbering, but at least it's trying not to merge the nodes. I wasn't quite clear from your paper if you are doing something similar to this or just allowing multiple nodes in the same graph to be represented by the same string.

I also looked at the RikVN scripts that serialize/de-serialize the graphs to see if they had a better way to handle this. One thing to note about these scripts if they appear to be lossy. If you simply serialize / de-serialize and skip the model completely you get about a 0.98 smatch score, meaning there is some small error introduced by this. The serialize/de-serializer I'm using gives a 1.0 smatch so it shouldn't be introducing any error, though I can't be sure it's the "best" possible serialization format.

I'll be interested to hear if someone improves upon these numbers. It seems to me there's at least a few things to try that might push the results up a couple of points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant