Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diffing and merging notebooks #8

Merged
merged 5 commits into from
Mar 8, 2016
Merged

Diffing and merging notebooks #8

merged 5 commits into from
Mar 8, 2016

Conversation

martinal
Copy link
Contributor

No description provided.

@minrk minrk changed the title Add JEP notebook-diff. Diffing and merging notebooks Dec 14, 2015
* Command line tools for interactive resolution of merge conflicts
* Make the merge tool git compatible
* Make a web gui for displaying notebook diffs
* Make a web gui for interactive resolution of merge conflicts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make sure that the web gui is fully integrated with Jupyter. Otherwise it will be very difficult for this to gain adoption, especially on systems (remote JupyterHub) where users only have access to the system through Jupyter in the browser.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the most part, I think this should be standalone, but it should at least play well with the new UI. I would expect most cases to be triggered by command-line calls, with a web view that only lives for a short period of time, dedicated to one diff view. If we properly separate how to request a diff of two notebooks from how they are displayed, adding a UI to request a diff from the web shouldn't be difficult, but I also think it shouldn't be the priority.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really have similar APIs for requesting diffs and merges from python, javascript and commandline. If the UI can be built as extensions to other JupyterLab work, that sounds great to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@minrk I don't quite understand how to use this pull request for discussion. Should I add text to the document?

@minrk
Copy link
Member

minrk commented Feb 11, 2016

@ellisonbg @fperez should we merge this enhancement proposal as "accepted"? We are working on the nbdime repo, and can iron out the details of implementation later, but I think the "we are going to do something about this" bar is cleared, yes?

@rgbkrk
Copy link
Member

rgbkrk commented Feb 11, 2016

I sure think we should merge this as is and keep iterating.

@ellisonbg
Copy link
Contributor

+1

On Thu, Feb 11, 2016 at 1:31 PM, Kyle Kelley notifications@github.com
wrote:

I sure think we should merge this as is and keep iterating.


Reply to this email directly or view it on GitHub
#8 (comment)
.

Brian E. Granger
Associate Professor of Physics and Data Science
Cal Poly State University, San Luis Obispo
@ellisonbg on Twitter and GitHub
bgranger@calpoly.edu and ellisonbg@gmail.com

reuse as much as possible from jupyter notebook. An initial
version of conflict resolution can be to output a notebook with
conflicts marked within cells, to be manually edited as a regular
jupyter notebook.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though below it says that merge of arbitrary output isn't in scope, it should clarify what it will produce in the scenario where the input doesn't change but the output does. We don't have tools to edit output, nor to present multiple outputs and delete them selectively. So this important edge case will need to be handled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!

@fperez
Copy link
Member

fperez commented Feb 11, 2016

Yup @minrk , "we are doing something about this" :) But one key thing that seems missing to me (and I don't find it in the nbdime repo either) is a high-level description of the approach taken.

To me, the notebook diff problem seems separable in two phases: identification of changes to the 'container' by finding cell insertions, deletions, transpositions, moves, etc, and then within-cell diffing, that then becomes content/mimetype-specific. This is a more complex version of the problem of finding the location of changes in a linear pure-text file, and then applying the local diff itself.

That's the approach that seems natural to me, but if that's not the case, then what approach is being used should be described.

Second, I think at least mention of what happens to metadata should be made. We have a lot of metadata in the notebook, and there can be differences therein.

So, while I apologize for the delay, I'm -1 on merging this document until these high-level ideas are a bit better described. While I realize that the actual implementation in the nbdime repo is where the 'real work' happens, I think it's also important that the high-level description is reasonably complete.

It shouldn't be too difficult to add a bit of language to that effect, and the document in the long run will be much more valuable. We want our JEPs to serve a similar role to Python's PEPs, in that they provide reasonable stand-alone descriptions of the problem and the proposed solution, that one can read to understand the whole thing at a high-level without digging into the implementation itself.

@willingc
Copy link
Member

FYI for those that would like more info on Python's PEP process. It's detailed in PEP-0001 https://www.python.org/dev/peps/pep-0001/#what-belongs-in-a-successful-pep

There is also a shorter description of the PEP process in the Python Dev Guide https://docs.python.org/devguide/langchanges.html

Anecdotally, the Achille's heel of the PEP process is that it is a slow slog with long periods of inactivity before acceptance of a PEP. Given Jupyter's rapid pace of development, I wonder if the PEP model, which is more a gatekeeper process than rapid innovation, is the best analogy. I love the focus on openness and transparency for JEPs; I would find it sad to see development flow ebb, due to PEP process inefficiences.

Perhaps "concept" status (which would equate to 5 or so committers/community members thinking it is worth exploring) would be a reasonable step before "draft" status; the benefit would be a place for a document to be worked on in the open and versioned while giving people an opportunity to iterate on the "concept" doc toward "draft" status. 🐢 🐰 🏃

@minrk
Copy link
Member

minrk commented Feb 12, 2016

@willingc that's a good point, and I think one thing that helps the JEP process be perhaps less cumbersome to the project as a whole is that relatively few changes warrant a JEP (e.g. a totally new project, like this one). The vast majority of changes, even relatively large ones, can go through the regular PR process.

@fperez thanks for the feedback! It's AOK to add things to do here, there just weren't any todo items for a while, so I thought I'd ping for a merge. Perhaps I should have phrased it as "Should we merge, or are there more things people would like added?" We'll get right on addressing your points.

@martinal
Copy link
Contributor Author

@fperez Indeed, we have an algorithm for matching up cells like you describe. We'll write up a high level description for the jpep and document the currently implemented variant better. Probably last half of next week unless @minrk beats me to it.

@fperez
Copy link
Member

fperez commented Feb 12, 2016

Indeed, the Python PEP process can be too slow sometimes, but I also think that for key ideas it can be useful to stop and think carefully through the key implications. We don't need to set the bar that high, and can still allow for iteration/evolution, but I think it's also good practice to have the base document in a reasonably complete state of ideas before merge.

Thanks @minrk and @martinal for responding, I think with a few such changes, the overall description will be much improved and will help others understand the direction of these ideas. Nothing prevents the actual implementation from evolving further and even changing this initial description, of course.

Martin Sandve Alnæs added 2 commits February 23, 2016 14:55
* Provided a high-level description of the diff algorithm approach.
* Updated with the major changes to the diff format that nbdime already implements (mostly copy-paste from current nbdime docs).
* Many smaller updates.
@martinal
Copy link
Contributor Author

I've updated the JEP in some significant ways, recommend reading through the new version instead of looking at the diffs. Hope this covers the most important things. I believe the questions not covered here are best handled as specific issues towards the nbdime repository.

@fperez
Copy link
Member

fperez commented Feb 23, 2016

Thanks @martinal! Much better indeed.

The only thing I see missing is at least some acknowledgment of the metadata issue and some indication of what your strategy there would be. It doesn't need to be too detailed, and I'm sure it will evolve. But metadata is pervasive to the notebook, and will be the source of interesting questions and technical challenges, so at least these questions should be acknowledged here for the proposal to really cover the key points of the problem.

Once that is done, I'm happy merging. Thanks again!

@bollwyvl
Copy link

For the Big Ol Unstructured JSON Data Tree,
https://github.com/benjamine/jsondiffpatch is pretty compelling. Demo here:
http://benjamine.github.io/jsondiffpatch/demo/index.html

They, too, offer yet another transformation language:
https://github.com/benjamine/jsondiffpatch/blob/master/docs/deltas.md

Reminds me of a certain XKCD...

On Tue, Feb 23, 2016 at 2:18 PM Fernando Perez notifications@github.com
wrote:

Thanks @martinal https://github.com/martinal! Much better indeed.

The only thing I see missing is at least some acknowledgment of the
metadata issue and some indication of what your strategy there would be. It
doesn't need to be too detailed, and I'm sure it will evolve. But metadata
is pervasive to the notebook, and will be the source of interesting
questions and technical challenges, so at least these questions should be
acknowledged here for the proposal to really cover the key points of the
problem.

Once that is done, I'm happy merging. Thanks again!


Reply to this email directly or view it on GitHub
#8 (comment)
.

@martinal
Copy link
Contributor Author

@fperez I'm not quite sure what "the metadata issue" is. Basically the plan is to treat it as a json blob in general, for which diff and merge basically works fine using already functional generic tools. @minrk has mentioned there are cases where we can know more about what the metadata means, in which case we can add options to do validation, use left, use right, etc. For visualization of arbitrary metadata, the demo @bollwyvl links to looks pretty nice.

@bollwyvl I think I saw this project before and rejected it because the way they approach comparing items in array diffs is too limiting. Also note that their diff format is even more terse than my original format ;) We should make a note of checking if they do something smart though.

@fperez
Copy link
Member

fperez commented Feb 24, 2016

@martinal, it's possible e.g. to have metadata changes in cells that are otherwise identical. But our UI doesn't by default display this explicitly, and for most users this is "hidden state". So the system will need to surface these changes and handle them in parallel to input/output, which is what users probably think most of the time about. It's not so much a technical problem (I agree with you, they are just another JSON blob), more of a UI/UX/workflow one.

But you don't need much more, just a quick pointer of what your plan is there for now, so we flag the issue. The implementation details can emerge in development.

@martinal
Copy link
Contributor Author

@fperez so I see this as a two part problem.

One is computing the diff (the merge follows), where the idea so far has been to make source code the primary identifier for cell identity, output a secondary identifier, and metadata ignored at the moment. There is a framework in place for implementing a sequence of predicates that include more or less of the cell contents to determine if they're to be treated as 'the same cell'. This can be extended to include e.g. edit distance of metadata blobs at the stricter equality levels, while ignoring metadata at the lower strictness. At the least strict level any change to output and metadata is ignored as long as source is approximately equal. These predicates will need testing and experimenting to make the algorithm behave sensibly.

The other is the UX part. In the diff visualization there can be for example another box above, between or below cell source and output, showing metadata conflicts by showing the json tree, one line for each leaf key:value, aligned with corresponding keys between left/right view. For somewhat structured data I think a simple approach like this could be the best. Either the user can edit the json directly, and we validate it on the fly, or a UI is built to handle editing in a safer way.

minrk and others added 2 commits March 8, 2016 09:47
@minrk
Copy link
Member

minrk commented Mar 8, 2016

@fperez I added a note about diffing metadata. Anything left to do on this?

@fperez
Copy link
Member

fperez commented Mar 8, 2016

Great, thanks! Merging now.

fperez added a commit that referenced this pull request Mar 8, 2016
High-level description of a toolchain for diffing and merging of notebooks.  Currently prototype implementation exists as the nbdime project.
@fperez fperez merged commit 340e9d8 into jupyter:master Mar 8, 2016
@fperez
Copy link
Member

fperez commented Mar 8, 2016

Done, thanks everyone for patiently taking feedback into consideration.

@minrk
Copy link
Member

minrk commented Mar 9, 2016

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants