Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Order of categories influences chi_square statistic #2189

Open
ViacheslavP opened this issue Dec 2, 2024 · 1 comment
Open

Order of categories influences chi_square statistic #2189

ViacheslavP opened this issue Dec 2, 2024 · 1 comment

Comments

@ViacheslavP
Copy link

Steps to reproduce

  1. Create a simple dataset with 2:1 ration
data.sql

For some reason, I was unable to run soda with lesser number of rows

create table Employee (
                        id int primary key,
                        name varchar(255)
);

insert into Employee (id, name) values (1, 'Alice');
insert into Employee (id, name) values (2, 'Bob');
insert into Employee (id, name) values (3, 'Alice');

insert into Employee (id, name) values (11, 'Alice');
insert into Employee (id, name) values (12, 'Bob');
insert into Employee (id, name) values (13, 'Alice');

insert into Employee (id, name) values (21, 'Alice');
insert into Employee (id, name) values (22, 'Bob');
insert into Employee (id, name) values (23, 'Alice');

insert into Employee (id, name) values (31, 'Alice');
insert into Employee (id, name) values (32, 'Bob');
insert into Employee (id, name) values (33, 'Alice');

insert into Employee (id, name) values (41, 'Alice');
insert into Employee (id, name) values (42, 'Bob');
insert into Employee (id, name) values (43, 'Alice');

insert into Employee (id, name) values (51, 'Alice');
insert into Employee (id, name) values (52, 'Bob');
insert into Employee (id, name) values (53, 'Alice');
  1. Run the following check
checks for Employee:
  - row_count = 18

  - distribution_difference(name) < 0.05:
      method: chi_square
      distribution reference file: ./distribution.yaml

with distribution.yaml:

dataset: employee
column: name
distribution_type: categorical
distribution_reference:
  weights:
  - 0.7
  - 0.3
  bins:
  - Alice
  - Bob

Expected behavior

chi_square statistic is close to zero, since the number of Alice rows is 12 and Bob's is 6

Actual behavior

the statistic value is high (~0.6)

Misc

When I change the order of weights but not the bins, the statistic is OK

@tools-soda
Copy link

CLOUD-8980

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants