Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

describe doing value table wrong? #119

Open
HedvigS opened this issue May 20, 2024 · 2 comments
Open

describe doing value table wrong? #119

HedvigS opened this issue May 20, 2024 · 2 comments
Assignees

Comments

@HedvigS
Copy link
Collaborator

HedvigS commented May 20, 2024

Doesn't this report from pygrambank describe look wrong for number of values in ELDP-glottobank/grambank/original_sheets/FCE_apal1257.tsv

Values:
0	121
1	 67
?	  1
	189

For comparison, here's what I get in R

> library(tidyverse)
> 
> SHEET <- read_tsv("../../../glottobank/ELDP-glottobank/grambank/original_sheets/FCE_apal1257.tsv")
> 
> SHEET$Value %>% table()
.
  ?   0   1 
  1 126  74 

I take it it has to do with the binarised features somehow?

@johenglisch
Copy link
Collaborator

First of all: that table only shows the number valid datapoints. If a data point has an error in it or describes an unknown feature it won't show up in the table. That's where the discrepancy comes from.

Second: Whether or not a feature is ‘known‘ seems to depend on gb20.txt. The binarised features aren't in there, yet. Should be fixable by copying the gb20.txt from the regular repo over to the ELDP one.

Also, I wanted to check a sheet in regular grambank and picked HC_swed1254 at random. It said the following features are not valid:

GB029   GB063   GB101   GB154   GB181   GB208   GB220   GB232   GB245   GB269   GB287
GB032   GB064   GB102   GB157   GB182   GB209   GB221   GB234   GB246   GB271   GB288
GB033   GB066   GB106   GB162   GB183   GB210   GB222   GB235   GB247   GB272   GB289
GB034   GB067   GB112   GB163   GB191   GB211   GB223   GB236   GB248   GB274   GB290
GB040   GB076   GB125   GB168   GB194   GB212   GB224   GB237   GB249   GB277   GB292
GB045   GB077   GB128   GB169   GB195   GB213   GB225   GB238   GB251   GB278   GB293
GB050   GB078   GB141   GB173   GB199   GB214   GB226   GB239   GB255   GB279   GB294
GB055   GB085   GB142   GB174   GB200   GB215   GB227   GB240   GB258   GB280   GB295
GB056   GB087   GB143   GB175   GB202   GB216   GB228   GB241   GB259   GB281
GB060   GB088   GB144   GB176   GB205   GB217   GB229   GB242   GB261   GB282
GB061   GB097   GB145   GB178   GB206   GB218   GB230   GB243   GB267   GB283
GB062   GB100   GB153   GB179   GB207   GB219   GB231   GB244   GB268   GB284

Just to double-check: Anything surprising in this list?

@HedvigS
Copy link
Collaborator Author

HedvigS commented May 26, 2024

First of all: that table only shows the number valid datapoints. If a data point has an error in it or describes an unknown feature it won't show up in the table. That's where the discrepancy comes from.

Thanks!

Second: Whether or not a feature is ‘known‘ seems to depend on gb20.txt. The binarised features aren't in there, yet. Should be fixable by copying the gb20.txt from the regular repo over to the ELDP one.

the binarised features gb20.txt though.. https://github.com/glottobank/Grambank/blob/4c280aec01e30fa8bf53ad622b3b8e51699b1649/gb20.txt#L4620

Also, I wanted to check a sheet in regular grambank and picked HC_swed1254 at random. It said the following features are not valid:

GB029   GB063   GB101   GB154   GB181   GB208   GB220   GB232   GB245   GB269   GB287
GB032   GB064   GB102   GB157   GB182   GB209   GB221   GB234   GB246   GB271   GB288
GB033   GB066   GB106   GB162   GB183   GB210   GB222   GB235   GB247   GB272   GB289
GB034   GB067   GB112   GB163   GB191   GB211   GB223   GB236   GB248   GB274   GB290
GB040   GB076   GB125   GB168   GB194   GB212   GB224   GB237   GB249   GB277   GB292
GB045   GB077   GB128   GB169   GB195   GB213   GB225   GB238   GB251   GB278   GB293
GB050   GB078   GB141   GB173   GB199   GB214   GB226   GB239   GB255   GB279   GB294
GB055   GB085   GB142   GB174   GB200   GB215   GB227   GB240   GB258   GB280   GB295
GB056   GB087   GB143   GB175   GB202   GB216   GB228   GB241   GB259   GB281
GB060   GB088   GB144   GB176   GB205   GB217   GB229   GB242   GB261   GB282
GB061   GB097   GB145   GB178   GB206   GB218   GB230   GB243   GB267   GB283
GB062   GB100   GB153   GB179   GB207   GB219   GB231   GB244   GB268   GB284

Just to double-check: Anything surprising in this list?

Yeah, a lot. I'll check :)!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants