Multi lingual dataset support #6

RickMoynihan · 2017-09-07T12:06:18Z

RDF supports lang strings, and there's a possibility of multi-lingual datasets.

We may want to add support for this as part of OGI.

zeginis · 2017-09-18T09:26:30Z

I agree. At OGI there are multi-lingual datasets.

We may consider using JSON-LD (@language) to express the language used

RickMoynihan · 2017-09-18T10:54:54Z

I'm no longer sure we can use JSONLD, but am curious about the requirements for multiple languages.

For example would a multilingual client want to list all labels in all languages? Or should it only ever get back a single requested (or default) language?

e.g. you could imagine changing the language at the outermost field for the whole subtree e.g.:

{  
   datasets(language:"fr") { 
      title
      dimensions { 
         values {
           label 
         }
      }
   }
}

Obviously we could also let you query for what languages are currently in the system, e.g.

{
   languages { 
       country_code
   }
}

Other alternatives are to expand every string field into two sub fields of lang and value, which seems pretty heavy handed. Or to generate fields in the schema for every language in the system e.g. title_fr title_en title_gb.

zeginis · 2017-09-18T12:44:40Z

I think a single requested or default language is enough. So something like "datasets(language:"fr"){..}; is ok.

It is preferable to get the available languages for a specific dataset not for the whole system because different datasets may have different available languages. e.g.

{
   languages(dataset: "http://statistics.gov.scot/data/earnings") { 
       country_code
   }
}

RickMoynihan · 2017-09-18T12:55:37Z

It is preferable to get the available languages for a specific dataset not for the whole system because different datasets may have different available languages.

👍

zeginis · 2017-12-07T09:49:36Z

Some issues related to the language:

greek labels were not supported
language tag (e.g. @en) causes errors

RickMoynihan · 2017-12-07T10:56:42Z

Specifically the current problem with language strings is that they cause exceptions during schema generation by failing the following spec (from issue #53):

In: [0 :objects :dataset_vehicles_cube 1 :description] val: #grafter.rdf.protocols.LangString{:string "Vehicles Cube", :lang :en} fails spec: :com.walmartlabs.lacinia.schema/description at: [:args :schema :objects 1 :description] predicate: string?

@zeginis I think it would be desirable to keep the graphql schema simple here and avoid having to represent multiple languages in the schema at this stage, i.e. we should avoid doing things like this for every label/title:

{ 
  title {
     title  # the real title string
     language
  }    
}

i.e. I think I'd rather keep the schema for labels flat like this:

{
   title
}

This will probably mean in the cases of multiple languages setting a default to use everywhere throughout the API; we could potentially allow toggling the default at the top of the query.

@zeginis Does that sound like an acceptable compromise? Limitation is that within a single request you'll not be able to see things like the title for a dataset in english and greek.

zeginis · 2017-12-07T11:10:46Z

It is ok to define the language at the top of the query and thus get results only in one language

RickMoynihan · 2017-12-07T11:43:32Z

One other question @zeginis, would it be acceptable to not let you set this at the top of the query; but to supply it as a configuration option to the server itself? i.e. no schema representation at all?

zeginis · 2018-06-22T07:35:49Z

One other question @zeginis, would it be acceptable to not let you set this at the top of the query; but to supply it as a configuration option to the server itself? i.e. no schema representation at all?

@RickMoynihan this solution is not applicable at OGI since we will have cubes from many pilots at the same server that will have labels in different languages e.g. Greek, English.

So it is preferable to define the language at the top of the query. Any idea how to do this?

RickMoynihan · 2018-06-22T09:53:23Z

Any idea how to do this?

It's not currently supported; if you're asking about how I think it should be implemented, then I'd suggest:

We should introduce a new root cubiql node to support various parameters such as this for all subtree schemas. The idea being that parameters set at the root affect those parts of the query within its lexical scope:

i.e. we would probably have to change it to do this, so lang_preference affects not just datasets but specific dataset schemas, and any others we add too:

{  
   cubiql(lang_preference: "gr") { 
      datasets { 
          title 
          description
      }  
   }

The lang_preference attribute should specify a language tag preference, not a hard constraint, i.e. if you have a dataset with a dcterms:title of "School"^^xsd:string and "σχολείο"@gr it should select the greek title. If however for description it only has :school-ds dcterms:description "Numbers of schools by area"^^xsd:string, then it should fallback and return the xsd:string`.

In terms of implementation I don't think there is a good way to express this priority on labels in SPARQL in a performant and simple enough way. So I think the best way to implement this is to make sure we implement all these queries as CONSTRUCTs, and then implement the priority filtering on all returned data. The algorithm roughly would be to group the local graph of ?s ?p ?o by ?s ?p, then for each ?s ?p where ?o is DATATYPE xsd:string || rdf:langString return only ?o where the ?o matches lang_preference or failing that return an xsd:string and failing that return any other rdf:langString.

Is something like this what you were thinking of implementing?

zeginis · 2018-06-22T10:11:16Z

Yes this is what I was thinking to implement. I realize that it is not as simple as I expected.

Do you think there is a way to temporarily overcome the exceptions (#88) caused by the language tags even if we do not fully support filtering by language?

RickMoynihan · 2018-06-22T15:30:06Z

That's a good question @zeginis. I suspect it's a pretty trivial fix to make that specific error go away, as it's probably not much more than calling str on the language tagged string before returning it.

However there's still the expectation that there's only ONE value for a lot of these fields. So this would likely only really work for string properties with a cardinality of 1; as to retain the schema you'll need to pick just one string; and then you're into the territory of the above suggestion.

I could be wrong but I'm not sure this hacky solution is worth doing, because you either need to implement the prioritisation logic above, or return a random string (unnacceptable as datasets would render with mixed languages), or hack your data so you only ever have one string for these fields (either an rdf:langString or xsd:string would work -- but not both or more than one of each - i.e. no multi-lingual datasets). My feeling is if you have to hack your data to remove the strings you don't want, you might as well have just hacked your data to make them xsd:strings.

The only counter-argument I can see to this (in support of implementing the str hack) is that it does mean cubiql will support a more correct subset of a larger cube. i.e. it's marginally better to allow "σχολείο"@gr, in preference to "σχολείο"^^xsd:string; as you're not downgrading information; you're just loading a subset into cubiqls endpoint.

Practically speaking though, I'm not sure this correctness argument holds much weight though as you'll still need to hack your data to guarantee it works... it's just the hack is a tiny bit less hacky.

zeginis · 2018-07-12T10:19:11Z

@RickMoynihan any update on this?

Are you going to fix this or we should go on with the "quick fix" option -> call the str on the language tagged string before returning it ?

Issue #6 - Move all dataset queries under a new top-level cubiql query used to set global parameters on the contained queries. Datasets are now fields on the qb type returned from the cubiql query.

RickMoynihan changed the title ~~Multiple languages~~ Multiple lingual dataset support Sep 7, 2017

RickMoynihan changed the title ~~Multiple lingual dataset support~~ Multi lingual dataset support Sep 18, 2017

This was referenced Dec 6, 2017

Exception during schema creation at our server #61

Closed

Customize CubiQL based on data #64

Closed

zeginis mentioned this issue Jun 22, 2018

still can't run latest version 0.2.0 locally to test it #88

Closed

RickMoynihan mentioned this issue Jul 3, 2018

Missing and replicated cubes in api response #83

Closed

zeginis mentioned this issue Jul 12, 2018

Document CubiQL data requirements #92

Merged

RickMoynihan mentioned this issue Jul 19, 2018

Multi lingual schemas? #105

Open

RickMoynihan assigned lkitching Jul 19, 2018

This was referenced Aug 2, 2018

Issue 6 #110

Merged

Configure language to be used for GraphQL schema generation #112

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi lingual dataset support #6

Multi lingual dataset support #6

RickMoynihan commented Sep 7, 2017

zeginis commented Sep 18, 2017

RickMoynihan commented Sep 18, 2017

zeginis commented Sep 18, 2017

RickMoynihan commented Sep 18, 2017

zeginis commented Dec 7, 2017

RickMoynihan commented Dec 7, 2017 •

edited

Loading

zeginis commented Dec 7, 2017

RickMoynihan commented Dec 7, 2017

zeginis commented Jun 22, 2018

RickMoynihan commented Jun 22, 2018

zeginis commented Jun 22, 2018

RickMoynihan commented Jun 22, 2018 •

edited

Loading

zeginis commented Jul 12, 2018

Multi lingual dataset support #6

Multi lingual dataset support #6

Comments

RickMoynihan commented Sep 7, 2017

zeginis commented Sep 18, 2017

RickMoynihan commented Sep 18, 2017

zeginis commented Sep 18, 2017

RickMoynihan commented Sep 18, 2017

zeginis commented Dec 7, 2017

RickMoynihan commented Dec 7, 2017 • edited Loading

zeginis commented Dec 7, 2017

RickMoynihan commented Dec 7, 2017

zeginis commented Jun 22, 2018

RickMoynihan commented Jun 22, 2018

zeginis commented Jun 22, 2018

RickMoynihan commented Jun 22, 2018 • edited Loading

zeginis commented Jul 12, 2018

RickMoynihan commented Dec 7, 2017 •

edited

Loading

RickMoynihan commented Jun 22, 2018 •

edited

Loading