Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot get dimension values for reference area and period #55

Closed
zeginis opened this issue Nov 24, 2017 · 11 comments
Closed

Cannot get dimension values for reference area and period #55

zeginis opened this issue Nov 24, 2017 · 11 comments

Comments

@zeginis
Copy link
Contributor

zeginis commented Nov 24, 2017

The result contains an empty list for those two dimensions

@zeginis
Copy link
Contributor Author

zeginis commented Nov 28, 2017

For example:

{dataset_births{
  dimensions{
    uri
    values {
      enum_name
      label
      uri
}}}}

@RickMoynihan
Copy link
Member

This is problematic because dimension values are enums and some datasets contain many thousands of areas etc... So there are problems making this scale for the RootSchema as it is multiplied by each dataset.

@RickMoynihan
Copy link
Member

Ok, we've been discussing this some more, and I thought I'd write up a few notes here, with a bit more detail than we went in to on the call.

Basically the current implementation is correct in this bevaiour because of a sensible compromise. Essentially because there are many 10's of 1000's of areas we represent areas as a string type, so there is no enum_name for it; likewise the same will be true for reference period. It's not that useful here, but it is in one sense correct (though it should contain the uri's at least).

It's also worth noting that our current approach to schema generation is that types are generated from the DSD for each dataset; so you'll see that for genders on scotland every dataset with a gender dimension has its own graphql schema type for it:

screen shot 2017-12-01 at 16 24 32

This is worth highlighting as it means that if we were to generate enum types for areas with the existing model the schema would explode to an almost unusable size, as every dataset would have its own copy of the areas.

I've opened a new issue #60 about sharing distinct schema types across datasets to look into this.

@BillSwirrl
Copy link
Member

I see the point about having an area enum per dataset is impractical - sharing codelists cross datasets as per issue 60 sounds promising.

I don't really understand your comment above about areas being a 'string type' - i.e. I'm not sure what that means to a user in terms of (a) discovering the possible values and (b) selecting observations by fixing the refArea dimension to a particular value (or list of possible values). Could you elaborate on that?

Or is that question not relevant if we can solve issue 60 re codelist/enum re-use and treat refArea as an enum?

@RickMoynihan
Copy link
Member

RickMoynihan commented Dec 4, 2017

I don't really understand your comment above about areas being a 'string type' - i.e. I'm not sure what that means to a user in terms of (a) discovering the possible values and (b) selecting observations by fixing the refArea dimension to a particular value (or list of possible values). Could you elaborate on that?

Sure, if the type of something is String there's essentially an infinite number of values it could be, so we can't really enumerate the options in the schema (well we could let you query for all strings separetely I suppose, but the information would be communicated outside of the schema; so probably workable but not ideal).

If the type is an Enum, all the possible values are enumerated in the schema / type information.

However your question has made we realise that we somewhat have our wires crossed. As graphql has its own reflective capabilities via __schema queries, and these datasets queries are in a sense rebuilding those, but with a slightly more specialised interface. So my comments have been targetted more at the underlying issue of what happens in the schema rather than the datasets query directly.

To explain a bit more, a query like this graphql __schema query with some clientside filtering/postprocessing can also be used to get the dimension values for a dataset. If you search in the response to that query you'll find sections like this:

            {
              "name": "gender",
              "description": "Gender",
              "type": {
                "name": "dataset_births_gender_type",
                "kind": "ENUM",
                "enumValues": [
                  {
                    "description": null,
                    "name": "MALE"
                  },
                  {
                    "description": null,
                    "name": "FEMALE"
                  },
                  {
                    "description": null,
                    "name": "ALL"
                  }
                ],
                "description": null
              }

The datasets query is obviously a lot simpler than this query; and it provides less noise in the results, and allows filtering by domain/qb concepts; however the problems exist closer to this level and not the level of Dimitris query posted above; though the issues are visible there.

So to answer (a), discovering possible values is currently only supported for enum types; and Enums are semantically the correct type to map to dimension values in a cube. So I think for areas at least solving #60 would let us represent them as enums which would solve them here.

For refPeriods we could do the same, but would need a good way to turn them into valid graphql enum syntax. I think part of #40 would need be to define something like a :graphQLEnum predicate that we can attach to dimension values, along with whatever with else we need.

So I think solving #60 and #40 will effectively let us solve this. Does that make sense and answer your question @BillSwirrl?

@RickMoynihan
Copy link
Member

RickMoynihan commented Dec 4, 2017

I should also point out that for (b) building a query / fixing dimensions, that the dimension values being in the schema as enums helps developers a lot here e.g.

screen shot 2017-12-04 at 08 47 09

If the type of dimension value is just a String the developer would need to know what to enter, and how to encode it; you can't just pick a value (e.g. you won't get completion for reference_area because it's a String). The stuff @lkitching is exploring at making refPeriods expressible as criteria may help guide developers here also by letting them specify ranges etc...

However in answering (b) there's also the question of helping tools build queries dynamically, and that is the purpose of the datasets queries (and also reflection via __schema).

See here for an example query that demonstrates this.

@RickMoynihan
Copy link
Member

RickMoynihan commented Dec 8, 2017

Ok proposal for this is that we support querying both styles of dimension values like this:

{
dataset_births{
  dimensions {
    ... on Dimension { 
      uri 
      values { 
        uri 
        label
      }
    }
    ... on EnumDimension { 
       enum_name 
       values {
          enum_name
       }
    }
  }
 }
}

With types/interfaces looking something like this (includes basic ideas for refArea):

interface Resource {
  uri: ID!
  label: String!
}

interface Dimension {
  uri: ID!
  label: String!
}

interface DimensionValue {
  uri: ID!
  label: String!
}

type DefaultDimension implements Dimension {
  uri: ID!
  label: String!
  values: [DimensionValue]  
}

type DefaultDimensionValue implements DimensionValue {
  uri: ID!
  label: String!
}

type EnumDimensionValue implements DimensionValue {
  uri: ID!
  label: String!
  enum_name: String!
}

type EmumDimension implements Dimension { 
  values: [EnumDimensionValue]
}

type HierarchicalValue implements DimensionValue { # i.e. could be a RefAreaValue
  uri: ID!
  label: String!
  children: [DimensionValue] # NOTE you can't have recursive datatypes in graphql :-( but we could potentially improve later by generating more specific types for each area level etc...
}

type HierarchicalDimension implements Dimension { # i.e. could be a RefAreaDimension
  values: [HierarchicalValue]
}

NOTE: for this part of the schema that the basic "out of the box" Dimension is really just a Resource type, types should then be Unionable with grahphql in a manner to what I proposed with schema gen in issue #40.

@zeginis
Copy link
Contributor Author

zeginis commented Dec 8, 2017

@RickMoynihan this looks good.

Just a question. Why should we hardcode the RefAreaValue and RefAreaDimension ?

The only difference I see is children: [DimensionValue] that is also required for every dimension that has hierarchical data e.g. also for refPeriod.

I think it is beter to generalize this to HierarchicalValue and HierarchicalDimension

@RickMoynihan
Copy link
Member

RickMoynihan commented Dec 15, 2017

Agree on the generalisation aspect, I had the same thoughts when writing the example, but chose to describe it concretely to try and make it clearer. Will edit snippet above & rename them though to what you suggest.

I should also say I think my proposal is still pretty minimal in functionality, and I think the limitations of the model above for HierarchicalDimension's and the lack of recursive datatypes might not be good enough for what we actually want.

I think solving these problems essentially involve us abstracting over Types themselves with a CubiQL notion of Kinds. i.e. CubiQL would recursively create all the types necessary to represent each level of hierarchy in the graphql schema, essentially working around GraphQL's lack of recursive data types. So in CubiQLHierarchalDimension would effectively be like a Kind in type theory, which we'd expand out into GraphQL types, effectively one Type for each "refAreaLevel" in the hierarchy to avoid the infinite recursion. As an illustration in terms of scotlands data this would mean generating a DZ type at the leaves, with an IZ type above it etc... There's some overlap of this with some WIP we're doing at Swirrl in defining a vocabulary for describing these levels, though I'm not sure how stable that is either? @BillSwirrl @RicSwirrl.

We could probably get something like the schema in my comment above working in a week or two; but I think doing something more complete will require a lot more detailed spec work to figure out the limitations/vocabs/schemas we require.

@BillSwirrl
Copy link
Member

What we have found in our 'features of interest' approach in PublishMyData is that a strict hierarchy for geographical data (or organisations etc) is too restrictive, because we might want different hierarchies for different datasets.

Also, even for a single tree-structure hierarchy, it can be convenient to jump levels in the tree. A common requirement is to get all the data zones in a council area, and it's useful to get those directly.

So we want

council area --> data zone

Not:

council area --> ward --> data zone
or
council area --> intermediate zone --> data zone

(note those are two different hierarchical relationships as intermediate zones don't nest inside wards in Scottish geography).

Also when we start mixing in data about hospitals or schools or job centres, we might want to know which hospitals are in a council area.

The approach we've taken in PublishMyData is that a feature of interest (area, organisation, etc) can be a memberOf one or more 'collections', and that the feature can be within various other features.

The 'within' relationship could be generalised to other kinds of relationships between items in codelists, for example a medical treatment might be 'offeredBy' a hospital.

I think this basic data model is generic enough that it could work with everyone's data so could be appropriate to use in CubiQL. But we'll need to define and document the specific triples we expect, and data publishers will have to augment their codelists with the collection and relationship data

@RickMoynihan
Copy link
Member

The main motivation for this has been fixed; but we should consider further refinements as part of a new issue #81.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants