-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Language-tagged strings #22
Comments
I'm concerned that a lot of the proposed solutions would bring with them just as many drawbacks. If a big part of the issue here is the challenge of querying language data, perhaps we should be looking first at the ergonomics of using SPARQL on this data and how that could be improved. |
I think that looking at language tags in isolation may not be the right approach. To be really international in nature, there are a number of things one may want to "say" about the text, and the natural language is only one of those. For example:
etc. The experience I have with other specifications that rely on RDF literals (though serializations like JSON-LD) is that these issues come and bite you all the time. Yes, this all may converge towards the separate issue on literal as subject (#21), and may force us to fundamentally re-think how RDF treats literals. |
A given word may be applicable to multiple languages, especially for loan words where one language borrows from another. Together with @iherman points about other related kinds of properties, this suggests that we need a means to model the combination of a string literal with a given set of properties. I suspect this also fits in with the desire to be able to model property graphs where nodes and links can be associated with sets of property-value pairs, where the values can themselves be sets of property-values and so forth recursively. Once you have that, it is straightforward to model a value as being a word in a given language with a given pronunciation, writing direction and so forth. We would still need a small set of core data types, e.g. string, number, boolean, ID, link, but others could be layered on top with properties as annotations. A node that is used for a natural language word or phrase could have one property for the string value, another for the language, and another for the pronunciation. I will expand on this further in another issue. |
Hi @iherman,
|
Hi @HughGlaser
I do not want to present myself as an i18n expert, I am very very far from it. I just think that RDF literals may have serious i18n issues, and if we want to review literals in general, we will have to seriously look at that, too... |
Yeah @iherman no problem with any of that. |
My personal hope is that we:
|
They currently have a special status in RDF. "RDF 1.1 Concepts and Abstract Syntax currently contains many caveats to accommodate the idiosyncratic nature of language-tagged strings"
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0090.html
"It is a real pain to create these 3 component literals and to query for different languages and datatypes in SPARQL.
And worse still, if you want to query for strings that may or may not have language tags on, you need to do some real messing about."
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0098.html
"Using a general way to make statements about literals sounds good to me.
For geographical data I also see too many statements being squashed into a
single literal. It is difficult to process and to store. . . . Why have a standard provision for
indicating the language of a text string and not its pronunciation for
example?"
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0102.html
"language codes do matter, but are pretty inconvenient for multiple reasons:
and counter-intuitive to RDF novices),
language tags, and (b) complex rules for composition, e.g., with script and
region codes), and
same language can let people used to work with 3-letter codes chose
2-letter codes, which is an easy error to make, but can result in failure
to compare, e.g., "cat"@eng and "cat"@en. Not sure what should happen when
you compare "рука"@sr-Cyrl with "рука"@sr. Both are identical, the first is
just more explicit in stating that this is Cyrillic.)
well-defined enough, and its extension is slow, bureaucratic and doubtful)."
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0116.html
"RDF seems to violate its own doctrine by having separate
systems for data types and languages of literals."
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0143.html
IDEA: Eliminate the special status of language-tagged strings
"would it be possible to do away with the special status of language-tagged strings? . . . Would it be possible to define a regular lexical space, e.g., containing "hello@en"^^rdf:langString, together with a value-2-lexical and a lexical-2-value mapping? The N3 and SPARQL notation "hello"@en will of course still be available, and will be syntactic sugar for "hello@en"^^rdf:langString."
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0090.html
"Surely languages and datatypes should simply be RDF properties of Literals, which are 1 component things?
Much easier to explain to developers, and for them to use."
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0098.html
"That also fits in nicely with making it easier to represent property graphs."
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0101.html
"it would be much more efficient to declare the language used only once, at the class and/or metadata level. Using plain properties to indicate language enables doing that."
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0145.html
CONCERN: "The RDF 1.1 WG did spend some time [on language tags] - both on putting the langtag
into the lexical space and putting the lang tag into the datatype. Both
are not so easy; in the end the rdf@langString at least meant all
literals had a datatype."
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0097.html
CONCERN: "chat"@en and "chat"@fr are different.
"chat" rdf:lang "en" .
"chat" rdf:lang "fr" .
makes every use of "chat" both @en and @fr.
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0148.html
"I think the only way to avoid this would be if subject literals are be
taken as a notational short-hand for a blank node that carries the literal
as an rdf:value. (And, in a separate step, a problem-specific bnode
skolemization routine could be provided to give it a proper URI.)"
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0156.html
"I really don't have a problem with every instance of "chat"^^xsd:string being both en and fr if someone has asserted that using rdf:lang. . . . Basically I think language tags are trying to avoid having to say in RDF what should be in the RDF."
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0164.html
IDEA: Use W3C OntoLex / Lemony as a basis for language tagging
"[It] is possible already [to declare language only once, at the class and/or metadata level] (using the pointers to ISO639 URIs in my earlier mail), and it is recommended practice to do so in OntoLex/lemon . . . . OntoLex is . . . a W3C community group report, but it would
be the most suitable basis for future standardization efforts in this direction."
https://www.w3.org/2016/05/ontolex/#lexicon-and-lexicon-metadata
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0145.html
IDEA: Use URIs to identify language
"A much more convenient solution would be to identify the language by means
of a URI. This can be an ISO 639 category (see under
http://id.loc.gov/vocabulary/iso639-2.html and
http://id.loc.gov/vocabulary/iso639-1.html; for ISO 639, cf.
http://www.lexvo.org/), or provided by another authority (e.g.,
https://glottolog.org/). Other properties (e.g., xsd datatypes) could also
be stated about a literal. Two strings could be considered identical if the
values are the same and the properties of one are a proper subset of the
properties of the other."
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0116.html
"a downward-compatible notation is possible:
following
BCP47 code [indicating region and script] and a URI [unambiguously
identifying the language])
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0119.html
CONCERN: "No. All literals MUST have a type, so that queries can have a
unique response when they ask for the type or specify the type.
The RDF 1.1 WG spent a lot of time and effort on this. Allowing
untyped plain literals in RDF 2004 was a bug. Please do not screw
this up again. Plain literals are syntactically legal (to
preserve backward compatibility) but they now have type xsd:string."
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0149.html
"But this only means that "рука" entails [a xsd:string] . . . .
As far as comparisons between strings are concerned, this makes no
difference to the example, as the subset relation between the (implicit)
properties of "рука"@sr and "рука" still holds"
https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0152.html
The text was updated successfully, but these errors were encountered: