Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where do test dirs props, null, and ne come from? #9

Open
Adamits opened this issue Jul 9, 2021 · 6 comments
Open

Where do test dirs props, null, and ne come from? #9

Adamits opened this issue Jul 9, 2021 · 6 comments

Comments

@Adamits
Copy link

Adamits commented Jul 9, 2021

Hi!

I noticed in make-wsj-test.sh and make-brown-test.sh that we try to zcat a props, null, and ne file from test.wsj. However, in the extract_test_from_ptb.sh and extract_test_from_brown.sh scripts, none of these dirs/files are generated. Where are these supposed to come from?

Thanks!

@strubell
Copy link
Owner

strubell commented Jul 10, 2021 via email

@Adamits
Copy link
Author

Adamits commented Jul 10, 2021

Thanks for the response!

I am probably missing something, but I thought the train directory only had data for sections 02-21 for wsj, whereas the test set is for sections 23. To be sure, I am referencing e.g. this line: https://github.com/strubell/preprocess-conll05/blob/master/bin/basic/make-wsj-test.sh#L13 - whereas https://github.com/strubell/preprocess-conll05/blob/master/bin/basic/extract_test_from_ptb.sh only generates words/syntax for section 23.

@XueBingo
Copy link

Hi!

I noticed in make-wsj-test.sh and make-brown-test.sh that we try to zcat a props, null, and ne file from test.wsj. However, in the extract_test_from_ptb.sh and extract_test_from_brown.sh scripts, none of these dirs/files are generated. Where are these supposed to come from?

Thanks!

Hello, I have the same problem with you.
Do you have any ideas now?
Thanks!

@strubell
Copy link
Owner

It sounds like you're describing the ptb training data, not the conll data - the directory I'm referring to is the $CONLL05 dir as defined in get_data.sh.

@Adamits
Copy link
Author

Adamits commented Jul 26, 2021

Yeah I guess so. I am asking about the test data in particular. Which appears to be section 23 of PTB.

So running ./bin/basic/extract_test_from_ptb.sh only extracts words and synts from section 23.

However, bin/basic/make-wsj-test.sh expects props, null, and ne as well. I think for the train/dev data, these dirs come from the conll05 releaser, in get_data.sh, however, section 23 (the test data) does not seem to be included in here.

But for the test data, where do these dirs come from? In bin/basic/make-wsj-test.sh:

zcat < $CONLL05/$FILE/words/$FILE.words.gz > /tmp/$$.words
    zcat < $CONLL05/$FILE/props/$FILE.props.gz > /tmp/$$.props
    zcat < $CONLL05/$FILE/synt/$FILE.$s.synt.gz > /tmp/$$.synt

    # no senses, set to null
    zcat < $CONLL05/$FILE/null/$FILE.null.gz > /tmp/$$.senses
    zcat < $CONLL05/$FILE/ne/$FILE.ne.gz > /tmp/$$.ne

cannot find the props, sense, or ne file, and then writes an empty archive.

@strubell
Copy link
Owner

strubell commented Aug 6, 2021

Oh, that's so strange! I guess the senses/ne lines (and corresponding entries in the paste) should be removed, but I'm surprised this non-working version is in the repo. Unfortunately I no longer have access to the old server where I originally developed/ran these scripts, so I can't go back and see if there were uncommitted changes, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants