Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Layout namespaces and discovery proposal #15

Closed
wants to merge 1 commit into from
Closed

Layout namespaces and discovery proposal #15

wants to merge 1 commit into from

Conversation

parente
Copy link
Member

@parente parente commented Apr 26, 2016

This PR proposes a very light weight spec for notebook layout metadata, addressing namespaces and discovery. It does not attempt to define a common schema for layout metadata intended to cover all tools and use cases for reasons given in the PR (see Intentionally Limited Scope) and based on the discussion in the roadmap PR about dashboards.

After dwelling on this for a while and writing it up, I don't think there's much value here versus simply asking layout tools to document their metadata format and rendering procedure, like we've now done for jupyter-incubator/dashboards. If there's anything of value here, it's in the basic guidelines given for picking a a unique namespace under metadata to avoid conflict. (But such guidelines are generally applicable across all tools that wish to write to metadata, not just layout tools.)

"IMHO" aside, we agreed at the dev meeting that an enhancement proposal on this topic was on the path of advancing dashboard support in Jupyter, so here it is for discussion.

cc'ing folks who were in the room when we discussed dashboards during the March dev meeting, and others who expressed an interest: @bollwyvl @fperez @ellisonbg @minrk @sccolbert @blink1073 @jasongrout @rgbkrk @lbustelo @jhpedemonte @dalogsdon

@rgbkrk
Copy link
Member

rgbkrk commented Apr 26, 2016

I'm all for getting this well specced. 😄

@bollwyvl
Copy link

Yeah, we kinda ran up against a wall of what we'd really be able to do with this (sorry, @parente for not getting back to you more).

We came up with these issues that make it hard to get to one data standard:

  • cell identity
  • number of layouts per notebook file
  • units
  • viewport vs document
  • infinite axes vs constrained axes
  • media queries

Of those, the cell identity one is really the most fundamental to this kind of work: having the layouts spread all over the place in different cells will inevitably make layouts more brittle. But having a "blessed" field (e.g. @id) could also be contentious.

As it is, we'd still be left with, say, nbshow having to know about the existence, and the entire metadata spec (and where it might get stored), of nbdash, which means N:N integrations... at which point, each extension is better off just having its own key, as there may be things unrelated to layouts stored in these locations.

It could be that there exists a suitable structured data vocabulary that we could bring in more-or-less whole cloth (via JSON-LD), but for the time being (and until notebook 5.0 hits), I don't see how we'd end up with anything other than work to show for it...

@parente
Copy link
Member Author

parente commented Apr 28, 2016

@rgbkrk

I'm all for getting this well specced. 😄

Define "this". :)

The metadata and rendering spec for the dashboards extension is here for the time being: https://github.com/jupyter-incubator/dashboards/wiki/Dashboard-Metadata-and-Rendering. We'll happily take issues about it or direct edits. There's probably an equivalent page for nbpresent (or certainly could be). And for RISE. And for the slideshow toolbar in Notebook.

I don't think there's one Jupyter spec that can rule them all. Or, at least, I personally have no clue how to write it at the moment.

@parente
Copy link
Member Author

parente commented May 6, 2016

Friday thought ...

Should this turn into a simple spec about where extensions should put their metadata in the notebook document?

@jhpedemonte, @dalogsdon, @nitind and I have been thinking about the metadata we use in the dashboards projects and how to go from what we evolved as we went open source and added features, to something cleaner and more easily extended in the future. (v1 draft over here: https://github.com/jupyter-incubator/dashboards/wiki/Dashboard-Metadata-and-Rendering) We still don't have an idea of how to write one spec for all tools to follow, but we did all agree that it's kind of silly to have a top level "layouts" key that the PR proposed. Really, all extensions that want to shove data into an ipynb are going to want some guidance on how to avoid stepping on each other, especially if creating extensions keeps getting easier (4.2, 5.0, ...)

The PR could simply become:

  1. metadata.extensions = {} is reserved for arbitrary extension data at both the notebook and cell levels. to avoid stepping on spec'ed notebook metadata (e.g., kernel info) or getting stepped on by future notebook schema changes.
  2. Extensions should create a key under metadata.extensions named after their GitHub repo, PyPI package, npm package, or some other reserved name.
  3. Extensions that want to be interoperable should version and document what they store under metadata.extensions.<name> for other developers.

@blink1073
Copy link
Contributor

Big 👍 from me.

@bollwyvl
Copy link

bollwyvl commented May 7, 2016

At risk of ♣️ing a 💀 🐴...

A big win would be to adopt a self-describing, application-layer meaning on top of the cell and notebook metadata, if not the whole document.

JSON Schema provides some of this, but as it isn't a managed standard, you kinda get what you get with your implementation. Also, the implementation of $ref, which is kind of a big deal, is a little vague.

The best standard out there I have found to this end is JSON-LD.

The big three of JSON-LD are:

  • @context: an in-or-out-of-band way to describe the meaning of the keys of the object in which it was defined, which uses:
  • @id: a strong identity
  • @type: a strong type, which can be a list

All of these take advantage of XML-style nestable namespace decimation, i.e. jptr --> https://jupyter.org/ns#.

The out-of-band part means that in many cases, existing JSON doesn't need to change.

A key exception to this is shapes like package.json's dependencies, which use an uncountable number of possible keys, which actually describe:

a package --- has --> a dependency
                        /       \
                    on package  at version
                      /           \
                     V             V
                some package    some version query

ANYHOOOO...

A document described in this way would then be able to have one of several things done to it:

  • expansion: a canonical expansion of all of the keys to full URIs
  • compaction: a re-structuring of the document
  • flattening: remove the internal structure of the tree, leaving a list of nodes with their links
  • framing: akin to GraphQL, sort of a query-by-example

This could circle back to the layout interop thing, as one could see, for example, an nbpresent slideshow and a dashboard dashboard could, by specialization or composition, have type NotebookLayout while a region and a masonry block could be CellPartLayout, even if none of those types actually appeared anywhere in the definition.

Part of this whole robust-ifying would be to create some types that would actually have impact elsewhere: for example, were we to extend the schema.org CreativeWork, with, say, ComputableDocument, we could start getting to the heart of the matter of what all this stuff is... and make the content available in them discoverable at Web Scale (not a dig at MongoDB, for once).

@parente
Copy link
Member Author

parente commented May 10, 2016

@bollwyvl I gave the notebook document you linked a read. I'd never seen it before.

I'll admit I don't completely grok the use cases it addresses. Can you give an example of what impact adding context, id, and type would have for layouts, or, say, nbpresent specifically?

@bollwyvl
Copy link

I gave the notebook document you linked a read. I'd never seen it before.

@parente Cool, thanks! The long con on this is being able to leverage the world + dog's notebooks as something at the scale of the Wolfram Language.

Consider if pandas.DataFrame.describe generated not only HTML, but embedded, machine-readable metadata. Without any further typing, one could ask for notebooks that included a dataset with:

  • a float column with values around 33.756±0.001 and
  • a float column with values around 84.352±0.001
    as a rough cut for "around atlanta". Obviously would need some work in post.

With a second layer of information beyond float, like https://schema.org/latitude one could be pretty darned sure that what you were getting out was indeed of interest to you.

Can you give an example of what impact adding context, id, and type would have for layouts, or, say, nbpresent specifically?

Inside nbpresent notebook metadata, a number of things eventually don't fit cleanly inside a hierarchy:

  • this slide appears after another slide
  • a cell source appears in these two slides

If, instead of doing ugly, MongoDB-style dereferencing with n+1 queries, I could treat the underlying document as a canonical location for writing, but be able to read/subscribe to a flattened form of the graph.

If there was a jupyter:MultiCellLayout that multiple extensions could embrace and extend, then not only could the nbpresent UI use it directly, but indeed be able to look through other top-level metadata information and find other layouts that might be interesting to import... without actual inference (i.e. if A subclasses B, then b, and instance of B is also an instance of A) one has to duplicate some information...

Here's what some of that data might look like: http://tinyurl.com/jzed66n

@parente
Copy link
Member Author

parente commented May 17, 2016

The long con on this is being able to leverage the world + dog's notebooks as something at the scale of the Wolfram Language.

OK. To echo back my understanding, it's about making notebook content more easily discoverable and reusable. The application of the approach to layout is just one example, and the fundamental idea would cover all notebook content.

If there was a jupyter:MultiCellLayout that multiple extensions could embrace and extend ...

I think this comes back to doing:

  1. something tactical in the short term to provide guidance on the relatively easy problem of:

*Where should extensions (layout or otherwise) that want to put data in the notebook document write their data so that it avoids conflict with other extensions AND with future notebook format changes?"

  1. starting a new proposal based on your thoughts above about developing a shared grammar, either specifically for layouts or notebook concepts in general, that addresses the much bigger problem space of:

"How should tools that write to notebook documents store content and metadata so that it can be identified and reused by other tools?"

This was the division of right now vs over time I was hinting at (poorly) in the roadmap. Shared concepts like jupyter:MultiCellLayout and agreement on IDs aren't going to spring up overnight, while extensions / plug-ins to Jupyter Lab already are.

I suggest turning this PR into guidance on the "where" problem (#15 (comment)) and starting a new PR about how LD-JSON could apply. I don't think getting this simple proposal discussed and accepted would hinder anything later: the ID, types, etc. will apply wherever extension or non-extension metadata is stored. By the same token, the simple recommendation of having extensions / plugins store metadata in metadata.extensions.<package name> heads off a potential mess as the notebook environment becomes more and more pluggable.

What do folks think, @bollwyvl in particular since he's the most likely candidate for seeding that new PR with his LD-JSON expertise?

@minrk
Copy link
Member

minrk commented May 18, 2016

I think identifying a recommendation for where extension metadata should go is a good idea. Currently, the official recommendation is an extension-specific location, but that is not specified in detail. A recommendation is to use metadata.your_extension = {}, rather than metadata.extensions.your_extension = {}. I don't have a strong preference either way, but there's some clarity in putting all extension metadata under metadata.extensions.

@parente
Copy link
Member Author

parente commented May 18, 2016

rather than metadata.extensions.your_extension = {}

I think recommending metadata.extensions is the way to go so that there's no chance of a future conflict between a new key in future notebook formats (v5+) and a name picked by an extension author. That is, metadata.something leaves the door open to headaches later where the official notebook schema wants to use something and an extension author is already using something.

@takluyver
Copy link
Member

I'll take the opposite position: metadata.extensions.foo feels like unnecessary nesting, at least when 'foo' is something unique enough that it's an implausible key for something else to use - e.g. cite2c stores stuff in metadata.cite2c.

Additionally, metadata is already a namespace for additional information - the official notebook schema can add keys outside of metadata. So having an 'extensions' namespace inside that feels a bit redundant.

@parente
Copy link
Member Author

parente commented May 24, 2016

@takluyver said:

I'll take the opposite position: metadata.extensions.foo feels like unnecessary nesting

If consensus forms around doing everything in metadata directly, then I don't think there's any value in doing a simple proposal. Extension authors are already putting keys of their choosing in notebook and cell metadata.

@parente said:

I suggest turning this PR into guidance on the "where" problem (#15 (comment)) and starting a new PR about how LD-JSON could apply.

Anyone else have an opinion on whether this simple proposal has any value (i.e., where extension metadata should go), if an LD-JSON JSON proposal should be separate, or otherwise?

@takluyver
Copy link
Member

Anyone else have an opinion on whether this simple proposal has any value (i.e., where extension metadata should go), if an LD-JSON JSON proposal should be separate, or otherwise?

I continue to think that LD-JSON, and semantic web technology in general, is an overcomplicated solution in search of a problem (this is a debate we've had before). So I'm quite happy for extension metadata to stay in a simple format.

@bollwyvl
Copy link

is an overcomplicated solution in search of a problem (this is a debate we've had before)

Indeed, it is in respect of this perspective I haven't pushed more on this issue more since raising it some years ago. Just to summarize, here are some of the problems I think adopting strongly-typed, URI-based metadata solves:

Discoverability

Search engines, journals, content republishers, etc. make heavy use of strongly-typed data to provide better results to users about traditional metadata: who said it, what it is about, when it was said, what you can do with it, who referenced it. The only debate, really, is which standard to use not whether this is a good idea: some folks don't like Dublin core, for example. If notebooks do not play in this space, they will always be treated more like a figure than the artifact of record, and generating content for these outlets will always require the kinds of manual steps that limit the speed and impact of publishing.

Documentation

The Matlab and Mathematica documentation user experiences are objectively superior to the documentation our users experience because of their language homogeneity, and the integration of the authoring runtimes. Of course, we could do that, too, post-hoc: tools exist for statically reverse-engineering the symbolic content of code in most of our kernels. However, by doing the kind of extensions needed for better completion, mentioned elsewhere, we could make this part of the kernel implementation itself. If we raise the bar for the naming of atomic units of code, every notebook cell (in context) would become a potential source of documentation, ideally, annotating the types at cell execution time.

Data

While there are different kernels, libraries, etc. the schema of the data of interest are often shared: tables, trees, graphs, singletons. As mentioned earlier, if a DataFrame.describe, as part of the _repr_html_, displayed metadata about the types and properties found there (via embedded JSON-LD or RDFa or even updates to the metadata), the global set of notebooks would become a lightweight, discoverable registry to the shapes of data people are working on.

Interoperability

If, when extension developers had the choice of building their own data format, or inheriting from an existing one, preferably managed outside of a specific implementation, we'd have the ability to start moving forward on the ecosystem building new and cool things. Describing "stuff in a viewport how the user wants it" is really quite a useful thing to achieve some consensus on, hence why I even brought this stuff up, again! Widgets, I think, would benefit tremendously, if they could carry units, etc.

Anyhow, just throwing this stuff out there!

@parente
Copy link
Member Author

parente commented Jan 10, 2017

I'm going to close out this proposal as it's lingered here for some time and failed to garner significant support. The dashboards extension documents where it puts its metadata and that's just fine.

@parente parente closed this Jan 10, 2017
@rgbkrk
Copy link
Member

rgbkrk commented Jan 10, 2017

Thanks @parente

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants