Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi lingual dataset support #6

Open
RickMoynihan opened this issue Sep 7, 2017 · 13 comments
Open

Multi lingual dataset support #6

RickMoynihan opened this issue Sep 7, 2017 · 13 comments
Assignees

Comments

@RickMoynihan
Copy link
Member

RDF supports lang strings, and there's a possibility of multi-lingual datasets.

We may want to add support for this as part of OGI.

@RickMoynihan RickMoynihan changed the title Multiple languages Multiple lingual dataset support Sep 7, 2017
@RickMoynihan RickMoynihan changed the title Multiple lingual dataset support Multi lingual dataset support Sep 18, 2017
@zeginis
Copy link
Contributor

zeginis commented Sep 18, 2017

I agree. At OGI there are multi-lingual datasets.

We may consider using JSON-LD (@language) to express the language used

@RickMoynihan
Copy link
Member Author

I'm no longer sure we can use JSONLD, but am curious about the requirements for multiple languages.

For example would a multilingual client want to list all labels in all languages? Or should it only ever get back a single requested (or default) language?

e.g. you could imagine changing the language at the outermost field for the whole subtree e.g.:

{  
   datasets(language:"fr") { 
      title
      dimensions { 
         values {
           label 
         }
      }
   }
}

Obviously we could also let you query for what languages are currently in the system, e.g.

{
   languages { 
       country_code
   }
}

Other alternatives are to expand every string field into two sub fields of lang and value, which seems pretty heavy handed. Or to generate fields in the schema for every language in the system e.g. title_fr title_en title_gb.

@zeginis
Copy link
Contributor

zeginis commented Sep 18, 2017

I think a single requested or default language is enough. So something like "datasets(language:"fr"){..}; is ok.

It is preferable to get the available languages for a specific dataset not for the whole system because different datasets may have different available languages. e.g.

{
   languages(dataset: "http://statistics.gov.scot/data/earnings") { 
       country_code
   }
}

@RickMoynihan
Copy link
Member Author

It is preferable to get the available languages for a specific dataset not for the whole system because different datasets may have different available languages.

👍

@zeginis
Copy link
Contributor

zeginis commented Dec 7, 2017

Some issues related to the language:

  • greek labels were not supported
  • language tag (e.g. @en) causes errors

@RickMoynihan
Copy link
Member Author

RickMoynihan commented Dec 7, 2017

Specifically the current problem with language strings is that they cause exceptions during schema generation by failing the following spec (from issue #53):

In: [0 :objects :dataset_vehicles_cube 1 :description] val: #grafter.rdf.protocols.LangString{:string "Vehicles Cube", :lang :en} fails spec: :com.walmartlabs.lacinia.schema/description at: [:args :schema :objects 1 :description] predicate: string?

@zeginis I think it would be desirable to keep the graphql schema simple here and avoid having to represent multiple languages in the schema at this stage, i.e. we should avoid doing things like this for every label/title:

{ 
  title {
     title  # the real title string
     language
  }    
}

i.e. I think I'd rather keep the schema for labels flat like this:

{
   title
}

This will probably mean in the cases of multiple languages setting a default to use everywhere throughout the API; we could potentially allow toggling the default at the top of the query.

@zeginis Does that sound like an acceptable compromise? Limitation is that within a single request you'll not be able to see things like the title for a dataset in english and greek.

@zeginis
Copy link
Contributor

zeginis commented Dec 7, 2017

It is ok to define the language at the top of the query and thus get results only in one language

@RickMoynihan
Copy link
Member Author

One other question @zeginis, would it be acceptable to not let you set this at the top of the query; but to supply it as a configuration option to the server itself? i.e. no schema representation at all?

@zeginis
Copy link
Contributor

zeginis commented Jun 22, 2018

One other question @zeginis, would it be acceptable to not let you set this at the top of the query; but to supply it as a configuration option to the server itself? i.e. no schema representation at all?

@RickMoynihan this solution is not applicable at OGI since we will have cubes from many pilots at the same server that will have labels in different languages e.g. Greek, English.

So it is preferable to define the language at the top of the query. Any idea how to do this?

@RickMoynihan
Copy link
Member Author

Any idea how to do this?

It's not currently supported; if you're asking about how I think it should be implemented, then I'd suggest:

  1. We should introduce a new root cubiql node to support various parameters such as this for all subtree schemas. The idea being that parameters set at the root affect those parts of the query within its lexical scope:

i.e. we would probably have to change it to do this, so lang_preference affects not just datasets but specific dataset schemas, and any others we add too:

{  
   cubiql(lang_preference: "gr") { 
      datasets { 
          title 
          description
      }  
   }
  1. The lang_preference attribute should specify a language tag preference, not a hard constraint, i.e. if you have a dataset with a dcterms:title of "School"^^xsd:string and "σχολείο"@gr it should select the greek title. If however for description it only has :school-ds dcterms:description "Numbers of schools by area"^^xsd:string, then it should fallback and return the xsd:string`.

In terms of implementation I don't think there is a good way to express this priority on labels in SPARQL in a performant and simple enough way. So I think the best way to implement this is to make sure we implement all these queries as CONSTRUCTs, and then implement the priority filtering on all returned data. The algorithm roughly would be to group the local graph of ?s ?p ?o by ?s ?p, then for each ?s ?p where ?o is DATATYPE xsd:string || rdf:langString return only ?o where the ?o matches lang_preference or failing that return an xsd:string and failing that return any other rdf:langString.

Is something like this what you were thinking of implementing?

@zeginis
Copy link
Contributor

zeginis commented Jun 22, 2018

Yes this is what I was thinking to implement. I realize that it is not as simple as I expected.

Do you think there is a way to temporarily overcome the exceptions (#88) caused by the language tags even if we do not fully support filtering by language?

@RickMoynihan
Copy link
Member Author

RickMoynihan commented Jun 22, 2018

That's a good question @zeginis. I suspect it's a pretty trivial fix to make that specific error go away, as it's probably not much more than calling str on the language tagged string before returning it.

However there's still the expectation that there's only ONE value for a lot of these fields. So this would likely only really work for string properties with a cardinality of 1; as to retain the schema you'll need to pick just one string; and then you're into the territory of the above suggestion.

I could be wrong but I'm not sure this hacky solution is worth doing, because you either need to implement the prioritisation logic above, or return a random string (unnacceptable as datasets would render with mixed languages), or hack your data so you only ever have one string for these fields (either an rdf:langString or xsd:string would work -- but not both or more than one of each - i.e. no multi-lingual datasets). My feeling is if you have to hack your data to remove the strings you don't want, you might as well have just hacked your data to make them xsd:strings.

The only counter-argument I can see to this (in support of implementing the str hack) is that it does mean cubiql will support a more correct subset of a larger cube. i.e. it's marginally better to allow "σχολείο"@gr, in preference to "σχολείο"^^xsd:string; as you're not downgrading information; you're just loading a subset into cubiqls endpoint.

Practically speaking though, I'm not sure this correctness argument holds much weight though as you'll still need to hack your data to guarantee it works... it's just the hack is a tiny bit less hacky.

@zeginis
Copy link
Contributor

zeginis commented Jul 12, 2018

@RickMoynihan any update on this?

Are you going to fix this or we should go on with the "quick fix" option -> call the str on the language tagged string before returning it ?

lkitching added a commit that referenced this issue Aug 1, 2018
Issue #6 - Move all dataset queries under a new top-level cubiql query
used to set global parameters on the contained queries. Datasets are
now fields on the qb type returned from the cubiql query.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants