Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharing distinct schema types where possible #60

Open
RickMoynihan opened this issue Dec 1, 2017 · 0 comments
Open

Sharing distinct schema types where possible #60

RickMoynihan opened this issue Dec 1, 2017 · 0 comments

Comments

@RickMoynihan
Copy link
Member

RickMoynihan commented Dec 1, 2017

Dimensions and other types aren't shared leading to a proliferation of schema types, e.g. for gender:

screen shot 2017-12-01 at 16 24 32

Whilst for gender it's not a huge problem, for areas/timeperiods etc it will lead to lots of highly duplicated data in the schema.... i.e millions of items.

If we could share the "distinct schema types" across datasets then things would be manageable, and orders of magnitude smaller.

"distinct schema types" here means "distinct dim/dim-val sets", i.e. I think on scotland there would currently only need to be three distinct gender dimension sets:

It's worth noting that the job of identifying distinct codelists would be easier if we could pass the buck and model the data that way in the first place. For example scotland currently duplicates codelists per dataset e.g. this codelist is unique, but could be reused by most of the gender datasets on scotland (assuming the data management practices did the right thing).

Managing code lists as distinct value sets would also make identifying comparable datasets easier, as they would literally re-use the same URI - but at the expense of extra complexity in handling dataset changes.

For areas the savings would obviously be much greater, as on scotland stats are published with full coverage every time; so it could be quite easily managed. For other areas with a more adhoc approach to coverage we'd need more intelligence in the data management to avoid duplicated types; though I suspect duplicating at the small scale e.g. within Trafford / GM is not a problem as there will be so much less data.

If we managed codelists in this way we could solve #55 more easily as having ~7000 datazones within a single enum isn't a major problem; but having to have 300 datasets * ~7000 that is.

@RickMoynihan RickMoynihan changed the title Shared schema types Sharing schema types where possible Dec 1, 2017
@RickMoynihan RickMoynihan changed the title Sharing schema types where possible Sharing distinct schema types where possible Dec 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant