Sharing distinct schema types where possible #60

RickMoynihan · 2017-12-01T16:38:07Z

Dimensions and other types aren't shared leading to a proliferation of schema types, e.g. for gender:

Whilst for gender it's not a huge problem, for areas/timeperiods etc it will lead to lots of highly duplicated data in the schema.... i.e millions of items.

If we could share the "distinct schema types" across datasets then things would be manageable, and orders of magnitude smaller.

"distinct schema types" here means "distinct dim/dim-val sets", i.e. I think on scotland there would currently only need to be three distinct gender dimension sets:

#{all male female} (e.g. http://statistics.gov.scot/data/reconvictions)
#{all male female unknown} (e.g. http://statistics.gov.scot/data/child-benefit)
#{male female} (e.g. http://statistics.gov.scot/data/life-expectancy)

It's worth noting that the job of identifying distinct codelists would be easier if we could pass the buck and model the data that way in the first place. For example scotland currently duplicates codelists per dataset e.g. this codelist is unique, but could be reused by most of the gender datasets on scotland (assuming the data management practices did the right thing).

Managing code lists as distinct value sets would also make identifying comparable datasets easier, as they would literally re-use the same URI - but at the expense of extra complexity in handling dataset changes.

For areas the savings would obviously be much greater, as on scotland stats are published with full coverage every time; so it could be quite easily managed. For other areas with a more adhoc approach to coverage we'd need more intelligence in the data management to avoid duplicated types; though I suspect duplicating at the small scale e.g. within Trafford / GM is not a problem as there will be so much less data.

If we managed codelists in this way we could solve #55 more easily as having ~7000 datazones within a single enum isn't a major problem; but having to have 300 datasets * ~7000 that is.

The text was updated successfully, but these errors were encountered:

RickMoynihan changed the title ~~Shared schema types~~ Sharing schema types where possible Dec 1, 2017

RickMoynihan changed the title ~~Sharing schema types where possible~~ Sharing distinct schema types where possible Dec 1, 2017

RickMoynihan mentioned this issue Dec 1, 2017

Cannot get dimension values for reference area and period #55

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharing distinct schema types where possible #60

Sharing distinct schema types where possible #60

RickMoynihan commented Dec 1, 2017 •

edited

Loading

Sharing distinct schema types where possible #60

Sharing distinct schema types where possible #60

Comments

RickMoynihan commented Dec 1, 2017 • edited Loading

RickMoynihan commented Dec 1, 2017 •

edited

Loading