Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDF validator didn’t find duplicated entities in a set #197

Closed
buchslava opened this issue Aug 9, 2016 · 11 comments
Closed

DDF validator didn’t find duplicated entities in a set #197

buchslava opened this issue Aug 9, 2016 · 11 comments

Comments

@buchslava
Copy link
Collaborator

entities id ​_can_​ be duplicated, for example male gender and male country, but never within a set. if they are that means the set is broken

@angiehjort
Copy link
Member

test file: in this entity list the validator should discover a duplicate
ddf--entities--tags.csv.zip

@jheeffer
Copy link
Member

jheeffer commented Aug 9, 2016

I don't completely agree.

Case 1. Angie's test file. Should warn if entities are identical and error if they are not (identical=properties are all the same).

Case 2. One entity defined in two files. E.g. geo swe is defined in ddf--entities--geo--country.csv and ddf--entities--geo--un_state.csv.
If there is no overlap in properties defined in the two files, no error or warning.
If there are properties which are defined in both files they must be identical, otherwise error. If there is no

In other words: if an entity is defined twice:
Warning if both definitions are identical, error if there is an inconsistency in definition, nothing if there is no inconsistency and definitions are not identical (one definition has properties that the other doesn't).

Does that make sense?

@angiehjort
Copy link
Member

yes, makes perfect sense

@buchslava
Copy link
Collaborator Author

@jheeffer one note: I created one file based version of this rule and ask @rychkog

  1. we should generate error for any entity id duplications (one file) -> WS expected
  2. I don't understand cases regarding this rule applying to more than one file. Give me please particular examples with data

ask please @rychkog regarding this rule regarding usage via WS

@jheeffer
Copy link
Member

jheeffer commented Sep 1, 2016

ddf--entities--geo--country.csv

geo  name    is--country
swe  Sweden  1

ddf--entities--geo--un_state.csv

geo  name    un_membership_year  is--un_state
swe  Sweden  1946                1

The above is valid but should give a warning because geo.name of swe is defined twice, but is equal so there's no error.
I am not sure about the warning here. On the one hand, it is good to try to limit the amount of redundant data in a dataset. Warnings could help with spotting these redundancies.
On the other hand, this could lead to maaaaany useless warnings. Maybe they should be per file: "geo.name is defined in both ddf--entities--geo--country.csv and ddf--entities--geo--un_state.csv and causes redundancy" instead of a per-entity warning. However, this is a more complex validation I think. It's possible for geo.name to be in multiple files without causing overlap/redundancy. Plus, duplicating geo.name over files could be useful for overview, so maybe a warning is not always in place? What do you think?


ddf--entities--geo--country.csv

geo  name    is--country
swe  Sweden  1

ddf--entities--geo--un_state.csv

geo  name               un_membership_year  is--un_state
swe  Kingdom of Sweden  1946                1

The above is invalid and should throw an error because geo.name for swe has two different values and thus there's a conflict for the value of geo.name for swe.

@jheeffer
Copy link
Member

jheeffer commented Sep 1, 2016

Also I'm fine with error'ing on duplicate ID in one file, as under your first case.

ddf--entities--geo--country.csv

geo  name     is--country
swe  Sweden   1
ukr  Ukraine  1
swe  Sweden   1

Invalid because of duplicate swe entity in one file, even though properties are all the same.

@buchslava
Copy link
Collaborator Author

buchslava commented Sep 1, 2016

@jheeffer This idea make sense only if I will analyze name as hardcoded constant. Is it good idea?

regarding warnings:

However, this is a more complex validation I think. It's possible for geo.name to be in multiple files without causing overlap/redundancy. Plus, duplicating geo.name over files could be useful for overview, so maybe a warning is not always in place? What do you think?

I think it's not a problem. No need to produce a warning in this case.

@jheeffer
Copy link
Member

jheeffer commented Sep 1, 2016

The above example is just that, an example. I just happen to use name in the example but it should work this way for any entity property, not just name.

I'm not sure what you mean by it not making sense if it's not hardcoded. Can you elaborate?

@buchslava
Copy link
Collaborator Author

@jheeffer yes I understand this idea: I'll get keys intersection for two records, for example,

for name is--country and name un_membership_year is--un_state intersection will be name

and after I'll analyze values for those fields (name). if they are equal - ok, else - error

@jheeffer
Copy link
Member

jheeffer commented Sep 1, 2016

Ok, good, but what is the problem with hardcoding you mentioned in your previous comment then?

also, if they are equal - "maybe warning" was what I wrote. What do you think? Ok or warning?

@buchslava
Copy link
Collaborator Author

@jheeffer no problem, my bad, sorry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants