Refactor `SourceKind` to store file content #6640

MichaReiser · 2023-08-17T07:51:46Z

Summary

This PR changes two things. You may want to take a look at the individual commits. I'm not entirely convinced whether we should land this change and it certainly needs more testing:

It changes Diagnostics to store the notebooks rather than the SourceKind. The sole reason is that we don't need the kind. We only need the notebook index to resolve cells. Ideally, this would be handled by a custom diagnostic kind, but we aren't there yet
I changed SourceKind to store the content for both Notebooks and regular Python files. This allows us to align lint_fix to never return the input in place, and instead return the updated result as a Cow<'a, SourceKind>. I find this easier to reason about than having to remember that notebooks are updated in place, but regular source code is not. However, this has the downside that we now need to clone the Notebooks, which may be rather expensive (multiple large vecs).

I'm looking for general feedback. Also happy if someone with more understanding on jupyter notebooks wants to take this over.

Test Plan

MichaReiser · 2023-08-17T07:51:57Z

Current dependencies on/for this PR:

main
- PR Avoid cloning source code multiple times #6629
  - PR Refactor SourceKind to store file content #6640 👈

This comment was auto-generated by Graphite.

github-actions · 2023-08-17T08:26:17Z

PR Check Results

Ecosystem

✅ ecosystem check detected no changes.

Benchmark

Linux

group                                      main                                   pr
-----                                      ----                                   --
formatter/large/dataset.py                 1.01      3.3±0.01ms    12.2 MB/sec    1.00      3.3±0.01ms    12.3 MB/sec
formatter/numpy/ctypeslib.py               1.01    677.9±9.42µs    24.6 MB/sec    1.00    673.6±8.83µs    24.7 MB/sec
formatter/numpy/globals.py                 1.00     70.4±0.33µs    41.9 MB/sec    1.00     70.2±0.66µs    42.0 MB/sec
formatter/pydantic/types.py                1.02  1360.9±14.58µs    18.7 MB/sec    1.00  1335.4±18.96µs    19.1 MB/sec
linter/all-rules/large/dataset.py          1.00     10.7±0.05ms     3.8 MB/sec    1.00     10.7±0.04ms     3.8 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.00      2.9±0.01ms     5.8 MB/sec    1.00      2.9±0.01ms     5.7 MB/sec
linter/all-rules/numpy/globals.py          1.00    323.2±2.29µs     9.1 MB/sec    1.00    324.4±1.02µs     9.1 MB/sec
linter/all-rules/pydantic/types.py         1.00      5.5±0.02ms     4.6 MB/sec    1.01      5.5±0.02ms     4.6 MB/sec
linter/default-rules/large/dataset.py      1.01      5.6±0.02ms     7.3 MB/sec    1.00      5.5±0.02ms     7.3 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00   1195.4±7.92µs    13.9 MB/sec    1.00   1194.8±6.09µs    13.9 MB/sec
linter/default-rules/numpy/globals.py      1.00    125.5±0.52µs    23.5 MB/sec    1.00    125.5±0.76µs    23.5 MB/sec
linter/default-rules/pydantic/types.py     1.00      2.5±0.01ms    10.1 MB/sec    1.00      2.5±0.02ms    10.1 MB/sec

Windows

group                                      main                                   pr
-----                                      ----                                   --
formatter/large/dataset.py                 1.00      4.8±0.22ms     8.5 MB/sec    1.05      5.0±0.29ms     8.2 MB/sec
formatter/numpy/ctypeslib.py               1.00   969.8±39.07µs    17.2 MB/sec    1.00   970.1±38.45µs    17.2 MB/sec
formatter/numpy/globals.py                 1.00     95.0±4.15µs    31.1 MB/sec    1.04     99.1±7.46µs    29.8 MB/sec
formatter/pydantic/types.py                1.00  1967.0±88.09µs    13.0 MB/sec    1.02  1997.4±114.85µs    12.8 MB/sec
linter/all-rules/large/dataset.py          1.00     17.8±0.48ms     2.3 MB/sec    1.03     18.4±0.59ms     2.2 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.01      5.0±0.18ms     3.3 MB/sec    1.00      5.0±0.32ms     3.3 MB/sec
linter/all-rules/numpy/globals.py          1.02   624.8±48.44µs     4.7 MB/sec    1.00   614.5±29.54µs     4.8 MB/sec
linter/all-rules/pydantic/types.py         1.00      9.2±0.30ms     2.8 MB/sec    1.05      9.7±0.37ms     2.6 MB/sec
linter/default-rules/large/dataset.py      1.00      9.9±0.33ms     4.1 MB/sec    1.05     10.4±0.33ms     3.9 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00      2.1±0.08ms     7.9 MB/sec    1.02      2.1±0.13ms     7.8 MB/sec
linter/default-rules/numpy/globals.py      1.00   254.0±15.06µs    11.6 MB/sec    1.04   263.9±13.66µs    11.2 MB/sec
linter/default-rules/pydantic/types.py     1.00      4.4±0.20ms     5.8 MB/sec    1.04      4.6±0.21ms     5.6 MB/sec

MichaReiser · 2023-08-17T08:48:18Z

crates/ruff/src/source_kind.rs

 use crate::jupyter::Notebook;

 #[derive(Clone, Debug, PartialEq, is_macro::Is)]
 pub enum SourceKind {
-    Python,
+    Python(String),


It would be nice to store the PySourceType on the SourceKind as well (making it a Program or SourceFile entity). However, it is a bit awkward because PySourceType has a Jupyter variant, but that variant should only be used to Jupyter...

MichaReiser · 2023-08-17T08:49:35Z

crates/ruff/src/linter.rs

@@ -63,7 +63,7 @@ pub struct FixerResult<'a> {
    /// The result returned by the linter, after applying any fixes.
    pub result: LinterResult<(Vec<Message>, Option<ImportMap>)>,
    /// The resulting source code, after applying any fixes.
-    pub transformed: Cow<'a, str>,
+    pub transformed: Cow<'a, SourceKind>,


The change here is that the linter now returns the transformed SourceKind which aligns the handling of Notebooks with regular python files in the sense that it doesn't mutate the source file in place.

MichaReiser · 2023-08-17T08:50:01Z

crates/ruff/src/linter.rs

@@ -374,7 +374,7 @@ pub fn lint_only(
        &directives,
        settings,
        noqa,
-        source_kind,
+        Some(source_kind),


Not sure why this is optional, but wanted to avoid increasing the scope of this even more.

The reason this is optional is because there's no --add-noqa support for Jupyter Notebooks yet. The logic for add_noqa_to_path basically calls into check_path and as far as I recall there were some circular dependency problems. Maybe it could be solved now with the recent changes.

Could add_noqa_to_path detect the PySourceType and exit early (with a warning) if it is a Jupyter notebook to avoid that all downstream code has to handle optional source kinds?

MichaReiser · 2023-08-17T08:51:02Z

crates/ruff/src/source_kind.rs

+    pub(crate) fn updated(&self, new_source: String, source_map: &SourceMap) -> Self {
+        match self {
+            SourceKind::Jupyter(notebook) => {
+                let mut cloned = notebook.clone();


This for sure is more expensive than the old "update in place".

Yeah, my main reason to update in place was to avoid cloning the entire notebook. Actually, the only thing we need to update with each linting cycle are the cell markers and the transformed source code can be looped over to the next iteration cycle. The cell content can be updated at the end before we actually update the Notebook that too only when --fix is passed.

Using the above idea, we could create a NotebookView which would only contain the necessary fields (for now I think it's cell markers only and the concatenated source code?) and should be much cheaper to clone. We would not even clone it and use an updated method on the view with the SourceMap and transformed source code and create a new view with the updated source code and cell markers. Once the linter iteration is complete, we can apply the view onto the Notebook, update the cell content then and write the Notebook back to the disk.

I think the abstraction would be to have a view into the source code be it Python or Jupyter and the view would have updated method to update the view. At the end the view will be applied to the concrete type and saved onto the disk.

This is just a rough idea based on my morning walk :)

I don't think I'm able to follow. Do you want to take a stab at this together with performing the renaming that you suggested? Or is this something that needs changing before landing this PR?

I tried to change raw to an Rc to get cheap cloning. However, it turns out that update_cell_content mutates the raw cells. The only larger structure that doesn't seem to get updated is valid_code_cells.

I think we could also move out index from the Notebook because it is only used in Diagnostics and instead have a JupyterIndex that is either Lazy(Notebook) or Built(Index).

Do you want to take a stab at this together with performing the renaming that you suggested?

Yeah, I can give it a go next week.

Or is this something that needs changing before landing this PR?

No

The only larger structure that doesn't seem to get updated is valid_code_cells.

What do you mean here by "larger structure"? As valid_code_cells is just a vector of u32.

However, it turns out that update_cell_content mutates the raw cells.

Yeah, this update isn't really required for every linter loop. We can avoid doing that and only update it at the end. This basically uses the concatenated source code and cell markers to update the cell content.

I think we could also move out index from the Notebook because it is only used in Diagnostics and instead have a JupyterIndex that is either Lazy(Notebook) or Built(Index).

Yeah, this is a good idea.

What do you mean here by "larger structure"? As valid_code_cells is just a vector of u32.

With large, I mean any non-fixed size heap allocated data structure that requires a deep clone.

charliermarsh · 2023-08-17T14:38:33Z

crates/ruff_cli/src/diagnostics.rs

@@ -64,7 +64,7 @@ pub(crate) struct Diagnostics {
    pub(crate) messages: Vec<Message>,
    pub(crate) fixed: FxHashMap<String, FixTable>,
    pub(crate) imports: ImportMap,
-    pub(crate) source_kind: FxHashMap<String, SourceKind>,
+    pub(crate) notebooks: FxHashMap<String, Notebook>,


This looks reasonable. So this allows us to keep the string content on SourceKind::Python, since we no longer store the Python source kind on here anyway.

charliermarsh

I think I'm supportive of this. It's really nice to avoid the &mut on the source kind.

MichaReiser · 2023-08-17T18:57:41Z

@dhruvmanila any opinion on this? Any recommendations on how to test the changes?

dhruvmanila

Thanks for making these changes :)

In the near future we should move towards finalizing an abstraction layer between different sources and the engines (linter, formatter, etc.).

crates/ruff/src/message/mod.rs

dhruvmanila · 2023-08-18T04:48:00Z

crates/ruff/src/source_kind.rs

+    pub(crate) fn updated(&self, new_source: String, source_map: &SourceMap) -> Self {
+        match self {
+            SourceKind::Jupyter(notebook) => {
+                let mut cloned = notebook.clone();


Yeah, my main reason to update in place was to avoid cloning the entire notebook. Actually, the only thing we need to update with each linting cycle are the cell markers and the transformed source code can be looped over to the next iteration cycle. The cell content can be updated at the end before we actually update the Notebook that too only when --fix is passed.

Using the above idea, we could create a NotebookView which would only contain the necessary fields (for now I think it's cell markers only and the concatenated source code?) and should be much cheaper to clone. We would not even clone it and use an updated method on the view with the SourceMap and transformed source code and create a new view with the updated source code and cell markers. Once the linter iteration is complete, we can apply the view onto the Notebook, update the cell content then and write the Notebook back to the disk.

I think the abstraction would be to have a view into the source code be it Python or Jupyter and the view would have updated method to update the view. At the end the view will be applied to the concrete type and saved onto the disk.

This is just a rough idea based on my morning walk :)

dhruvmanila · 2023-08-18T04:48:25Z

crates/ruff/src/test.rs

+    Ok(test_contents(&SourceKind::Python(contents), &path, settings).0)
+}
+
+pub(crate) struct TestedNotebook {


This is good. Thanks for doing this :)

dhruvmanila · 2023-08-18T05:11:04Z

Any recommendations on how to test the changes?

My main concern would be to test the linter loop (> 2 iterations) with --fix. The way we could do this is to use a source code where some fixes overlap so that then the other fixes gets applied in the next iteration allowing us to make sure the updates are taking place correctly. I gave a quick test and it's good. There are 4 iteration loops in the following test case. I'll add this test case in another PR.

Cells

import random
import math

try:
    print()
except ValueError:
    pass

import random
import pprint

random.randint(10, 20)

foo = 1
if foo is 2:
    raise ValueError(f"Invalid foo: {foo is 1}")

Command:

cargo run --bin ruff --package ruff_cli -- check --no-cache --isolated --select=I,F,SIM /path/to/test.ipynb --fix --diff

MichaReiser · 2023-08-18T08:31:10Z

Thanks @dhruvmanila for the extensive test plan. I played through the commands (showing diagnostics, diff, and fixing) and it works as expected

MichaReiser mentioned this pull request Aug 17, 2023

Avoid cloning source code multiple times #6629

Merged

MichaReiser commented Aug 17, 2023

View reviewed changes

MichaReiser requested review from charliermarsh and dhruvmanila August 17, 2023 08:53

MichaReiser marked this pull request as ready for review August 17, 2023 08:58

charliermarsh reviewed Aug 17, 2023

View reviewed changes

charliermarsh approved these changes Aug 17, 2023

View reviewed changes

MichaReiser added the internal An internal refactor or improvement label Aug 17, 2023

dhruvmanila approved these changes Aug 18, 2023

View reviewed changes

MichaReiser force-pushed the store-notebooks-in-diagnostics branch from 77ca065 to 83eab8e Compare August 18, 2023 08:35

MichaReiser changed the title ~~Store notebooks on Diagnostics~~ Refactor SourceKind to store file content Aug 18, 2023

MichaReiser force-pushed the store-notebooks-in-diagnostics branch 2 times, most recently from fbdec00 to a928045 Compare August 18, 2023 09:28

Base automatically changed from charlie/cow to main August 18, 2023 13:32

MichaReiser added 3 commits August 18, 2023 15:32

Store notebooks on Diagnostics

23a79ef

Add source to SourceKind

7c53255

Review feedback

49904dc

MichaReiser force-pushed the store-notebooks-in-diagnostics branch from a928045 to 49904dc Compare August 18, 2023 13:33

MichaReiser enabled auto-merge (squash) August 18, 2023 13:33

MichaReiser merged commit ea72d5f into main Aug 18, 2023
16 checks passed

MichaReiser deleted the store-notebooks-in-diagnostics branch August 18, 2023 13:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `SourceKind` to store file content #6640

Refactor `SourceKind` to store file content #6640

MichaReiser commented Aug 17, 2023 •

edited

Loading

MichaReiser commented Aug 17, 2023

github-actions bot commented Aug 17, 2023 •

edited

Loading

MichaReiser Aug 17, 2023

MichaReiser Aug 17, 2023

MichaReiser Aug 17, 2023

dhruvmanila Aug 17, 2023

MichaReiser Aug 18, 2023

MichaReiser Aug 17, 2023

dhruvmanila Aug 18, 2023

MichaReiser Aug 18, 2023 •

edited

Loading

dhruvmanila Aug 18, 2023

MichaReiser Aug 18, 2023

charliermarsh Aug 17, 2023

charliermarsh left a comment

MichaReiser commented Aug 17, 2023 •

edited

Loading

dhruvmanila left a comment

dhruvmanila Aug 18, 2023

dhruvmanila Aug 18, 2023

dhruvmanila commented Aug 18, 2023

MichaReiser commented Aug 18, 2023

Refactor SourceKind to store file content #6640

Refactor SourceKind to store file content #6640

Conversation

MichaReiser commented Aug 17, 2023 • edited Loading

Summary

Test Plan

MichaReiser commented Aug 17, 2023

github-actions bot commented Aug 17, 2023 • edited Loading

PR Check Results

Ecosystem

Benchmark

Linux

Windows

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaReiser Aug 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

charliermarsh left a comment

Choose a reason for hiding this comment

MichaReiser commented Aug 17, 2023 • edited Loading

dhruvmanila left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhruvmanila commented Aug 18, 2023

Cells

MichaReiser commented Aug 18, 2023

Refactor `SourceKind` to store file content #6640

Refactor `SourceKind` to store file content #6640

MichaReiser commented Aug 17, 2023 •

edited

Loading

github-actions bot commented Aug 17, 2023 •

edited

Loading

MichaReiser Aug 18, 2023 •

edited

Loading

MichaReiser commented Aug 17, 2023 •

edited

Loading