-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Diffing and merging notebooks #8
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,170 @@ | ||
# Diffing and merging notebooks | ||
|
||
## Problem | ||
|
||
Diffing and merging notebooks is not properly handled by standard linebased diff and merge tools. | ||
|
||
## Proposed Enhancement | ||
|
||
* Make a package containing tools for diff and merge of notebooks | ||
* Include a command line api | ||
* Pretty-printing of diffs for command line display | ||
* Command line tools for interactive resolution of merge conflicts | ||
* Make the merge tool git compatible | ||
* Make a web gui for displaying notebook diffs | ||
* Make a web gui for interactive resolution of merge conflicts | ||
* Plugin framework for mime type specific diffing | ||
|
||
|
||
## Detailed Explanation | ||
|
||
Preliminary work resides in [nbdime](https://github.com/martinal/nbdime). | ||
|
||
Fundamentally, we envision use cases mainly in the categories | ||
of a merge command for version control integration, and | ||
diff command for inspecting changes and automated regression | ||
testing. At the core of it all is the diff algorithms, which | ||
must handle not only text in source cells but also a number of | ||
data formats based on mime types in output cells. | ||
|
||
|
||
### Basic diffing use cases | ||
|
||
* View difference between versions of a file | ||
* View diff of sources only | ||
* View diff of output cells (basic text diff of output cells, image diff with external tool) | ||
|
||
|
||
### Version control use cases | ||
|
||
Most commonly, cell source is the primary content, | ||
and output can presumably be regenerated. Indeed, it | ||
is not possible to guarantee that merged sources and | ||
merged output is consistent or makes any kind of sense. | ||
|
||
The main use case for the merge tool will be a git-compatible commandline merge tool: | ||
|
||
nbmerge base.ipynb local.ipynb remote.ipynb merged.ipynb | ||
|
||
and a web gui for conflict resolution. Ideally the web gui can | ||
reuse as much as possible from jupyter notebook. An initial | ||
version of conflict resolution can be to output a notebook with | ||
conflicts marked within cells, to be manually edited as a regular | ||
jupyter notebook. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Even though below it says that merge of arbitrary output isn't in scope, it should clarify what it will produce in the scenario where the input doesn't change but the output does. We don't have tools to edit output, nor to present multiple outputs and delete them selectively. So this important edge case will need to be handled. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will do! |
||
|
||
Goals: | ||
|
||
* Trouble free automatic merge when no merge conflicts occur | ||
* Optional behaviour to drop conflicting output | ||
* Easy to use interactive conflict resolution | ||
|
||
Not planning (for now): | ||
|
||
* Merge of arbitrary output cell content | ||
|
||
Open questions: | ||
|
||
* Is it important to track source lines moving between cells? | ||
|
||
Should make a collection of tricky corner cases, and | ||
run merge tools on test cases from e.g. git if possible. | ||
|
||
|
||
### Regression testing use cases | ||
|
||
* View difference of output cells after re-running cells | ||
|
||
|
||
### Diff format | ||
|
||
A preliminary diff format has been defined, where the diff result is itself a json object. | ||
The details of this format is being refined. For examples of concrete diff | ||
objects, see e.g. the test suite for patch. | ||
|
||
|
||
#### Diff format for dicts (current) | ||
|
||
A diff of two dicts is a list of diff entries: | ||
|
||
key = string | ||
entry = [action, key] | [action, key, argument] | ||
diff = [entry0, entry1, ...] | ||
|
||
A dict diff entry is a list of action and argument (except for deletion): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. lovely stuff! would like to see nbviewer be able to take advantage of it: think more broadly, "changes to a notebook" seems like a data type that can be reused throughout the ecosystem. having it line up with the ongoing realtime sync work would be good. @fperez @Carreau If these aren't related, or that decision hasn't been made yet... here's 2c: unless some algorithmic complexity requires it, not using an existing standard format, or at least de facto format, will make it harder to reuse for other pieces in the ecosystem. The ones that spring to mind are:
additional reuse cases
[
{ "op": "replace", "path": "/big_data_frame/1234567/987654", "value": 42 }
]
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We looked at the standard generic JSON diff tools, but there are cases where we can do better than those, since we know some things about the meaning of the structure, it's not on opaque JSON object. It's possible that this would be better implemented as extensions to an existing standard, though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll be happy to add at least a conversion function from the internal patch format to JSON Patch, the conversion should be trivial. Then whether to use that format directly internally in nbdime is an implementation detail. We could possibly build on python-json-patch somehow, I'll look into that. Created an issue on making a JSON Patch conversion tool: |
||
|
||
* ["-", key]: delete value at key | ||
* ["+", key, newvalue]: insert newvalue at key | ||
* ["!", key, diff]: patch value at key with diff | ||
* [":", key, newvalue]: replace value at key with newvalue | ||
|
||
|
||
#### Diff format for dicts (alternative) | ||
|
||
A diff of two dicts is itself a dict mapping string keys to diff entries: | ||
|
||
key = string | ||
entry = [action] | [action, argument] | ||
diff = {key0: entry0, key1: entry1, ...} | ||
|
||
A dict diff entry is a list of action and argument (except for deletion): | ||
|
||
* ["-"]: delete value at key | ||
* ["+", newvalue]: insert newvalue at key | ||
* ["!", diff]: patch value at key with diff | ||
* [":", newvalue]: replace value at key with newvalue | ||
|
||
|
||
#### Diff format for sequences (list and string) | ||
|
||
A diff of two sequences is an ordered list of diff entries: | ||
|
||
index = integer | ||
entry = [action, index] | [action, index, argument] | ||
diff = [entry0, entry1, ...] | ||
|
||
A sequence diff entry is a list of action, index and argument (except for deletion): | ||
|
||
* ["-", index]: delete entry at index | ||
* ["+", index, newvalue]: insert single newvalue before index | ||
* ["--", index, n]: delete n entries starting at index | ||
* ["++", index, newvalues]: insert sequence newvalues before index | ||
* ["!", index, diff]: patch value at index with diff | ||
|
||
Possible simplifications: | ||
|
||
* Remove single-item "-", "+" and rename "--" and "++" to single-letter. | ||
* OR remove "--" and "++" and stick with just single-item versions. | ||
|
||
|
||
Note: The currently implemented sequence diff algorithm is | ||
based on a brute force O(N^2) longest common subsequence (LCS) | ||
algorithm, this will be rewritten in terms of a faster algorithm | ||
such as Myers O(ND) LCS based diff algorithm, optionally | ||
using Pythons difflib for use cases it can handle. | ||
In particular difflib does not handle custom compare predicate, | ||
which we need to e.g. identify almost equal cells within sequences | ||
of cells in a notebook. | ||
|
||
|
||
### Merge format | ||
|
||
The merge process should return two things: The merge result and the conflicts. | ||
|
||
A format for representing merge conflicts is work in progress. | ||
|
||
Each transformation in the base->local and base->remote diffs must either | ||
end up in the merge result or be recorded in the conflicts representation. | ||
|
||
|
||
## Pros and Cons | ||
|
||
Pros associated with this implementation include: | ||
* Improved workflows when placing notebooks in version control systems | ||
* Possibility to use notebooks for self-documenting regression tests | ||
|
||
Cons associated with this implementation include: | ||
* Vanilla git installs will not receive the improved behaviour | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But if you are building command line tools, couldn't those work with out git around? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, this will work fine for manual diffs of separate notebooks on the filesystem. This is meant to point out that the issue is not that you can't view diffs or do merges, it's that the real problem of git doing a bad job by default won't actually be improved. Merge conflicts / bad merges on GitHub, etc. cannot be addressed by a tool like this. |
||
|
||
## Interested Contributors | ||
@martinal @minrk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to make sure that the web gui is fully integrated with Jupyter. Otherwise it will be very difficult for this to gain adoption, especially on systems (remote JupyterHub) where users only have access to the system through Jupyter in the browser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the most part, I think this should be standalone, but it should at least play well with the new UI. I would expect most cases to be triggered by command-line calls, with a web view that only lives for a short period of time, dedicated to one diff view. If we properly separate how to request a diff of two notebooks from how they are displayed, adding a UI to request a diff from the web shouldn't be difficult, but I also think it shouldn't be the priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should really have similar APIs for requesting diffs and merges from python, javascript and commandline. If the UI can be built as extensions to other JupyterLab work, that sounds great to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@minrk I don't quite understand how to use this pull request for discussion. Should I add text to the document?