You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dimensions and other types aren't shared leading to a proliferation of schema types, e.g. for gender:
Whilst for gender it's not a huge problem, for areas/timeperiods etc it will lead to lots of highly duplicated data in the schema.... i.e millions of items.
If we could share the "distinct schema types" across datasets then things would be manageable, and orders of magnitude smaller.
"distinct schema types" here means "distinct dim/dim-val sets", i.e. I think on scotland there would currently only need to be three distinct gender dimension sets:
It's worth noting that the job of identifying distinct codelists would be easier if we could pass the buck and model the data that way in the first place. For example scotland currently duplicates codelists per dataset e.g. this codelist is unique, but could be reused by most of the gender datasets on scotland (assuming the data management practices did the right thing).
Managing code lists as distinct value sets would also make identifying comparable datasets easier, as they would literally re-use the same URI - but at the expense of extra complexity in handling dataset changes.
For areas the savings would obviously be much greater, as on scotland stats are published with full coverage every time; so it could be quite easily managed. For other areas with a more adhoc approach to coverage we'd need more intelligence in the data management to avoid duplicated types; though I suspect duplicating at the small scale e.g. within Trafford / GM is not a problem as there will be so much less data.
If we managed codelists in this way we could solve #55 more easily as having ~7000 datazones within a single enum isn't a major problem; but having to have 300 datasets * ~7000 that is.
The text was updated successfully, but these errors were encountered:
RickMoynihan
changed the title
Shared schema types
Sharing schema types where possible
Dec 1, 2017
RickMoynihan
changed the title
Sharing schema types where possible
Sharing distinct schema types where possible
Dec 1, 2017
Dimensions and other types aren't shared leading to a proliferation of schema types, e.g. for gender:
Whilst for gender it's not a huge problem, for areas/timeperiods etc it will lead to lots of highly duplicated data in the schema.... i.e millions of items.
If we could share the "distinct schema types" across datasets then things would be manageable, and orders of magnitude smaller.
"distinct schema types" here means "distinct dim/dim-val sets", i.e. I think on scotland there would currently only need to be three distinct gender dimension sets:
#{all male female}
(e.g. http://statistics.gov.scot/data/reconvictions)#{all male female unknown}
(e.g. http://statistics.gov.scot/data/child-benefit)#{male female}
(e.g. http://statistics.gov.scot/data/life-expectancy)It's worth noting that the job of identifying distinct codelists would be easier if we could pass the buck and model the data that way in the first place. For example scotland currently duplicates codelists per dataset e.g. this codelist is unique, but could be reused by most of the gender datasets on scotland (assuming the data management practices did the right thing).
Managing code lists as distinct value sets would also make identifying comparable datasets easier, as they would literally re-use the same URI - but at the expense of extra complexity in handling dataset changes.
For areas the savings would obviously be much greater, as on scotland stats are published with full coverage every time; so it could be quite easily managed. For other areas with a more adhoc approach to coverage we'd need more intelligence in the data management to avoid duplicated types; though I suspect duplicating at the small scale e.g. within Trafford / GM is not a problem as there will be so much less data.
If we managed codelists in this way we could solve #55 more easily as having ~7000 datazones within a single enum isn't a major problem; but having to have 300 datasets * ~7000 that is.
The text was updated successfully, but these errors were encountered: