Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diffing and merging notebooks #8

Merged
merged 5 commits into from
Mar 8, 2016
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
170 changes: 170 additions & 0 deletions notebook-diff/notebook-diff.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
# Diffing and merging notebooks

## Problem

Diffing and merging notebooks is not properly handled by standard linebased diff and merge tools.

## Proposed Enhancement

* Make a package containing tools for diff and merge of notebooks
* Include a command line api
* Pretty-printing of diffs for command line display
* Command line tools for interactive resolution of merge conflicts
* Make the merge tool git compatible
* Make a web gui for displaying notebook diffs
* Make a web gui for interactive resolution of merge conflicts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make sure that the web gui is fully integrated with Jupyter. Otherwise it will be very difficult for this to gain adoption, especially on systems (remote JupyterHub) where users only have access to the system through Jupyter in the browser.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the most part, I think this should be standalone, but it should at least play well with the new UI. I would expect most cases to be triggered by command-line calls, with a web view that only lives for a short period of time, dedicated to one diff view. If we properly separate how to request a diff of two notebooks from how they are displayed, adding a UI to request a diff from the web shouldn't be difficult, but I also think it shouldn't be the priority.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really have similar APIs for requesting diffs and merges from python, javascript and commandline. If the UI can be built as extensions to other JupyterLab work, that sounds great to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@minrk I don't quite understand how to use this pull request for discussion. Should I add text to the document?

* Plugin framework for mime type specific diffing


## Detailed Explanation

Preliminary work resides in [nbdime](https://github.com/martinal/nbdime).

Fundamentally, we envision use cases mainly in the categories
of a merge command for version control integration, and
diff command for inspecting changes and automated regression
testing. At the core of it all is the diff algorithms, which
must handle not only text in source cells but also a number of
data formats based on mime types in output cells.


### Basic diffing use cases

* View difference between versions of a file
* View diff of sources only
* View diff of output cells (basic text diff of output cells, image diff with external tool)


### Version control use cases

Most commonly, cell source is the primary content,
and output can presumably be regenerated. Indeed, it
is not possible to guarantee that merged sources and
merged output is consistent or makes any kind of sense.

The main use case for the merge tool will be a git-compatible commandline merge tool:

nbmerge base.ipynb local.ipynb remote.ipynb merged.ipynb

and a web gui for conflict resolution. Ideally the web gui can
reuse as much as possible from jupyter notebook. An initial
version of conflict resolution can be to output a notebook with
conflicts marked within cells, to be manually edited as a regular
jupyter notebook.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though below it says that merge of arbitrary output isn't in scope, it should clarify what it will produce in the scenario where the input doesn't change but the output does. We don't have tools to edit output, nor to present multiple outputs and delete them selectively. So this important edge case will need to be handled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!


Goals:

* Trouble free automatic merge when no merge conflicts occur
* Optional behaviour to drop conflicting output
* Easy to use interactive conflict resolution

Not planning (for now):

* Merge of arbitrary output cell content

Open questions:

* Is it important to track source lines moving between cells?

Should make a collection of tricky corner cases, and
run merge tools on test cases from e.g. git if possible.


### Regression testing use cases

* View difference of output cells after re-running cells


### Diff format

A preliminary diff format has been defined, where the diff result is itself a json object.
The details of this format is being refined. For examples of concrete diff
objects, see e.g. the test suite for patch.


#### Diff format for dicts (current)

A diff of two dicts is a list of diff entries:

key = string
entry = [action, key] | [action, key, argument]
diff = [entry0, entry1, ...]

A dict diff entry is a list of action and argument (except for deletion):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lovely stuff! would like to see nbviewer be able to take advantage of it: think git blame as an overlay in the notebook to see the history of a document.

more broadly, "changes to a notebook" seems like a data type that can be reused throughout the ecosystem. having it line up with the ongoing realtime sync work would be good. @fperez @Carreau

If these aren't related, or that decision hasn't been made yet... here's 2c: unless some algorithmic complexity requires it, not using an existing standard format, or at least de facto format, will make it harder to reuse for other pieces in the ecosystem. The ones that spring to mind are:

additional reuse cases

  • widgets could adopt a finer-than-key-level incremental update
[
  { "op": "replace", "path": "/big_data_frame/1234567/987654", "value": 42 }
]
  • much simpler than jupyter drive, a StreamingContentManager could "subscribe" to a notebook diff stream on a remote server that was actually just broadcast
    • nice for presentations :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We looked at the standard generic JSON diff tools, but there are cases where we can do better than those, since we know some things about the meaning of the structure, it's not on opaque JSON object. It's possible that this would be better implemented as extensions to an existing standard, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be happy to add at least a conversion function from the internal patch format to JSON Patch, the conversion should be trivial. Then whether to use that format directly internally in nbdime is an implementation detail. We could possibly build on python-json-patch somehow, I'll look into that.

Created an issue on making a JSON Patch conversion tool:
https://github.com/martinal/nbdime/issues/15


* ["-", key]: delete value at key
* ["+", key, newvalue]: insert newvalue at key
* ["!", key, diff]: patch value at key with diff
* [":", key, newvalue]: replace value at key with newvalue


#### Diff format for dicts (alternative)

A diff of two dicts is itself a dict mapping string keys to diff entries:

key = string
entry = [action] | [action, argument]
diff = {key0: entry0, key1: entry1, ...}

A dict diff entry is a list of action and argument (except for deletion):

* ["-"]: delete value at key
* ["+", newvalue]: insert newvalue at key
* ["!", diff]: patch value at key with diff
* [":", newvalue]: replace value at key with newvalue


#### Diff format for sequences (list and string)

A diff of two sequences is an ordered list of diff entries:

index = integer
entry = [action, index] | [action, index, argument]
diff = [entry0, entry1, ...]

A sequence diff entry is a list of action, index and argument (except for deletion):

* ["-", index]: delete entry at index
* ["+", index, newvalue]: insert single newvalue before index
* ["--", index, n]: delete n entries starting at index
* ["++", index, newvalues]: insert sequence newvalues before index
* ["!", index, diff]: patch value at index with diff

Possible simplifications:

* Remove single-item "-", "+" and rename "--" and "++" to single-letter.
* OR remove "--" and "++" and stick with just single-item versions.


Note: The currently implemented sequence diff algorithm is
based on a brute force O(N^2) longest common subsequence (LCS)
algorithm, this will be rewritten in terms of a faster algorithm
such as Myers O(ND) LCS based diff algorithm, optionally
using Pythons difflib for use cases it can handle.
In particular difflib does not handle custom compare predicate,
which we need to e.g. identify almost equal cells within sequences
of cells in a notebook.


### Merge format

The merge process should return two things: The merge result and the conflicts.

A format for representing merge conflicts is work in progress.

Each transformation in the base->local and base->remote diffs must either
end up in the merge result or be recorded in the conflicts representation.


## Pros and Cons

Pros associated with this implementation include:
* Improved workflows when placing notebooks in version control systems
* Possibility to use notebooks for self-documenting regression tests

Cons associated with this implementation include:
* Vanilla git installs will not receive the improved behaviour

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if you are building command line tools, couldn't those work with out git around?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this will work fine for manual diffs of separate notebooks on the filesystem. This is meant to point out that the issue is not that you can't view diffs or do merges, it's that the real problem of git doing a bad job by default won't actually be improved. Merge conflicts / bad merges on GitHub, etc. cannot be addressed by a tool like this.


## Interested Contributors
@martinal @minrk