-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support probabilistic knowledge representations #78
Comments
The current mindset for the Semantic Web is oblivious to these points, instead narrowly focusing on deductive logic and model theory. There is however a great deal to be gained by studying over 500 million years of neural evolution and decades of work across the cognitive sciences. Cognitive AI seeks to mimic human memory, reasoning, learning and human natural language processing at a functional level. This involves a combination of symbolic graphs, sub-symbolic statistics, rules and graph algorithms, along with a willingness to adopt an interdisciplinary approach to research, something that is unfortunately generally discouraged when it comes to incentives for academic careers. The W3C Cognitive AI Community Group is formalising the Chunks graph data and rules language inspired by earlier work by John Anderson on ACT-R, as a popular cognitive architecture. Chunks is easier to work with than Turtle, JSON and JSON-LD. It includes the means to map to RDF URIs where needed. As such Chunks is a viable candidate for Easier RDF, and one that opens up new vistas of opportunities to give computing a human touch. |
See also #71 |
Indeed, hence my proposal to somehow combine the two areas so that humans and machines can synergize better with most humans in control; undesirable but likely alternatives being with machines and/or "elites" being in control ( Also, humans are basically statistically blind which makes it very hard to make good decision in larger groups. Would be grateful for more comments on the specific topics and resources I mentioned. |
Just a historical point: several years ago the W3C did set up a “W3C Uncertainty Reasoning for the World Wide Web Incubator Group” which did publish a report. The report as well as the charter referred to above contain a large number of references and use cases, but there was no real follow-up on the report in terms of a W3C WG, i.e., for standardization. As far as I can remember there wasn't a clear, standardization-ready approach to go with, and the interest from W3C members was mild, to say the least. I have not followed the evolution since 2008 (I have drifted away from the subject area), and I do not know whether the area is more mature than it was back then (afaik, the topic was picked up at some subsequent ISWC conferences as workshops). |
Thanks. Interesting, will review.
I am not really expecting that the |
Good to know that things have evolved. The main reason I gave the reference is my experience (as a W3C Staff member) of the mild interest of the semantic web community back then to engage into a more systematic standardization work and I frankly do not know whether it would be easier to do it now. I sincerely hope... Cc @pchampin |
Within the OntoLex W3C CG, we're currently developing a novel module on Frequency, Attestation and Corpus-based Information (OntoLex-FrAC). Among other things, corpus-based information includes embeddings (in the "word embedding" sense as well as other numerical representations), with the specific goal to be applicable to any OntoLex concept (which includes ontologies in general). Representing word embeddings in RDF doesn't give much of a benefit, but for sense and concept embeddings that get easily detached from their definition, that's quite different. The current status will be presented at SemDeep on Jan 8th, 2021. |
Glad to hear about
Maybe in this regard is worth mentioning Recently I also found "Embedding OWL Ontologies with Related: |
That's exactly what we had in mind. The difference is that these approaches are typically oriented at algorithms, i.e., at creating embeddings from knowledge graphs or at extending knowledge graphs by embedding-based techniques. The OntoLex extension is purely representational. It is about storing (and re-using) such information together with a knowledge graph. With a standard vocabulary for this purpose, it will become possible to provide APIs to store and load such data bundles more efficiently and in a way that makes sure the user has eventually access to both the embeddings and the underlying graph. For certain kinds of knowledge graphs, having both information sources may be less essential, because the embeddings itself are a stochastical approximation and generalization over the knowledge graph and can be readily applied in different settings. For lexical information, however, this is very different, because there is a ground truth (in the dictionaries/wordnets) from which we want to deviate from only if the explicit information we have is insufficient (e.g., for out-of-vocabulary words). And provenance is key, here, because dictionaries differ widely in scale, quality, methodology, and purpose, and if aggregating over different dictionaries (which is a good idea in general, to improve coverage), we should be able to keep track of that information. As an example, the MUSE dictionaries by Facebook (https://github.com/facebookresearch/MUSE) have certain valid uses, but they contain a lot of noise, as they're automatically created from translation memories (I guess). For creating multilingual embeddings, they're probably sufficient, but not for applications in MT or localization. The Apertium dictionaries (https://apertium.org/), on the other hand, are much more carefully curated and specifically designed for MT. They are general-purpose, though, and lack support for specific domains. For some languages, they are small, and also, they emphasize lexical concepts whereas function words may be left out for certain languages. The existing multilingual WordNets, again, are great resources than can be used to complement dictionaries with semantic concepts, but they lack coverage of certain grammatical categories, and they have some imbalances in the taxonomy they posit. Now, we can just combine all that data in a single lexical knowledge graph, hoping that that compensates the respective biases, and then induce or create embeddings over it. But if we run into any weird behaviour in downstream applications, we should be able to track if this is related to the specific source of information involved, at least. (So we can disable, replace or fix it.) Of course, embeddings can be calculated on the fly, as well. But if it comes to multilingual applications, lexical resources can get quite substantial, e.g., https://github.com/acoli-repo/acoli-dicts. (Sorry for not providing a triple count, the basic unit we operate with are the existence of > 10.000 translations per language pairs.) So, inducing embeddings over this graph for every individual application is both a massive waste of energy and a compabitility hazard if different applications are supposed to operate on the same embeddings (inducing embeddings involves non-deterministic aspects). |
Indeed! |
Knowledge representations may have a probabilistic nature and capturing that is especially important for complex business domains where data is ... non-stationary or contextual.
This is often seen in
machine learning
models like#WordEmbedding
,#ContextualWordEmbedding
,#KnowledgeGraphEmbedding
, etc.Often semantic models are distilled by experts which use implicit expertise and extensive curation processes; this makes the final models lossy, not easily traceable to the original source data and ultimately less useful to non experts than they could be.
It would be very useful to have the ability to represent probabilistic knowledge in the semantic web world so that the models are more robust, defensible, trusted and accessible.
The intuition behind
probabilisitc RDF
seems somewhat related to language models like#ContextualWordEmbedding
which are probabilistic (have#DistributionalSemantic
s as opposed to the lexical models) but also capture context thus making the concepts more grounded and easily traceable to the original resources they were extracted from. 🤔Resources
Scalable Uncertainty Treatment Using Triplestores and the OWL 2 RL Profile
PR-OWL
, Multi-Entity #Bayesian Networks (#MEBN
)", "hybrid ontologies .. deterministic and probabilistic parts""Probabilistic RDF"
"build a logical model of
RDF
with uncertainty""Combining RDF Graph Data and Embedding Models for an Augmented Knowledge Graph"
"integrated #RDF data with vector space models",
#knowledgeGraph
,#wordEmbedding
,#graphEmbedding
"FoodEx2vec: New foods' representation for advanced food data analysis"
See also the
FoodOn
ontology.The text was updated successfully, but these errors were encountered: