Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor SourceKind to store file content #6640

Merged
merged 3 commits into from
Aug 18, 2023

Conversation

MichaReiser
Copy link
Member

@MichaReiser MichaReiser commented Aug 17, 2023

Summary

This PR changes two things. You may want to take a look at the individual commits. I'm not entirely convinced whether we should land this change and it certainly needs more testing:

  1. It changes Diagnostics to store the notebooks rather than the SourceKind. The sole reason is that we don't need the kind. We only need the notebook index to resolve cells. Ideally, this would be handled by a custom diagnostic kind, but we aren't there yet
  2. I changed SourceKind to store the content for both Notebooks and regular Python files. This allows us to align lint_fix to never return the input in place, and instead return the updated result as a Cow<'a, SourceKind>. I find this easier to reason about than having to remember that notebooks are updated in place, but regular source code is not. However, this has the downside that we now need to clone the Notebooks, which may be rather expensive (multiple large vecs).

I'm looking for general feedback. Also happy if someone with more understanding on jupyter notebooks wants to take this over.

Test Plan

@MichaReiser
Copy link
Member Author

Current dependencies on/for this PR:

This comment was auto-generated by Graphite.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 17, 2023

PR Check Results

Ecosystem

✅ ecosystem check detected no changes.

Benchmark

Linux

group                                      main                                   pr
-----                                      ----                                   --
formatter/large/dataset.py                 1.01      3.3±0.01ms    12.2 MB/sec    1.00      3.3±0.01ms    12.3 MB/sec
formatter/numpy/ctypeslib.py               1.01    677.9±9.42µs    24.6 MB/sec    1.00    673.6±8.83µs    24.7 MB/sec
formatter/numpy/globals.py                 1.00     70.4±0.33µs    41.9 MB/sec    1.00     70.2±0.66µs    42.0 MB/sec
formatter/pydantic/types.py                1.02  1360.9±14.58µs    18.7 MB/sec    1.00  1335.4±18.96µs    19.1 MB/sec
linter/all-rules/large/dataset.py          1.00     10.7±0.05ms     3.8 MB/sec    1.00     10.7±0.04ms     3.8 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.00      2.9±0.01ms     5.8 MB/sec    1.00      2.9±0.01ms     5.7 MB/sec
linter/all-rules/numpy/globals.py          1.00    323.2±2.29µs     9.1 MB/sec    1.00    324.4±1.02µs     9.1 MB/sec
linter/all-rules/pydantic/types.py         1.00      5.5±0.02ms     4.6 MB/sec    1.01      5.5±0.02ms     4.6 MB/sec
linter/default-rules/large/dataset.py      1.01      5.6±0.02ms     7.3 MB/sec    1.00      5.5±0.02ms     7.3 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00   1195.4±7.92µs    13.9 MB/sec    1.00   1194.8±6.09µs    13.9 MB/sec
linter/default-rules/numpy/globals.py      1.00    125.5±0.52µs    23.5 MB/sec    1.00    125.5±0.76µs    23.5 MB/sec
linter/default-rules/pydantic/types.py     1.00      2.5±0.01ms    10.1 MB/sec    1.00      2.5±0.02ms    10.1 MB/sec

Windows

group                                      main                                   pr
-----                                      ----                                   --
formatter/large/dataset.py                 1.00      4.8±0.22ms     8.5 MB/sec    1.05      5.0±0.29ms     8.2 MB/sec
formatter/numpy/ctypeslib.py               1.00   969.8±39.07µs    17.2 MB/sec    1.00   970.1±38.45µs    17.2 MB/sec
formatter/numpy/globals.py                 1.00     95.0±4.15µs    31.1 MB/sec    1.04     99.1±7.46µs    29.8 MB/sec
formatter/pydantic/types.py                1.00  1967.0±88.09µs    13.0 MB/sec    1.02  1997.4±114.85µs    12.8 MB/sec
linter/all-rules/large/dataset.py          1.00     17.8±0.48ms     2.3 MB/sec    1.03     18.4±0.59ms     2.2 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.01      5.0±0.18ms     3.3 MB/sec    1.00      5.0±0.32ms     3.3 MB/sec
linter/all-rules/numpy/globals.py          1.02   624.8±48.44µs     4.7 MB/sec    1.00   614.5±29.54µs     4.8 MB/sec
linter/all-rules/pydantic/types.py         1.00      9.2±0.30ms     2.8 MB/sec    1.05      9.7±0.37ms     2.6 MB/sec
linter/default-rules/large/dataset.py      1.00      9.9±0.33ms     4.1 MB/sec    1.05     10.4±0.33ms     3.9 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00      2.1±0.08ms     7.9 MB/sec    1.02      2.1±0.13ms     7.8 MB/sec
linter/default-rules/numpy/globals.py      1.00   254.0±15.06µs    11.6 MB/sec    1.04   263.9±13.66µs    11.2 MB/sec
linter/default-rules/pydantic/types.py     1.00      4.4±0.20ms     5.8 MB/sec    1.04      4.6±0.21ms     5.6 MB/sec

use crate::jupyter::Notebook;

#[derive(Clone, Debug, PartialEq, is_macro::Is)]
pub enum SourceKind {
Python,
Python(String),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to store the PySourceType on the SourceKind as well (making it a Program or SourceFile entity). However, it is a bit awkward because PySourceType has a Jupyter variant, but that variant should only be used to Jupyter...

@@ -63,7 +63,7 @@ pub struct FixerResult<'a> {
/// The result returned by the linter, after applying any fixes.
pub result: LinterResult<(Vec<Message>, Option<ImportMap>)>,
/// The resulting source code, after applying any fixes.
pub transformed: Cow<'a, str>,
pub transformed: Cow<'a, SourceKind>,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change here is that the linter now returns the transformed SourceKind which aligns the handling of Notebooks with regular python files in the sense that it doesn't mutate the source file in place.

@@ -374,7 +374,7 @@ pub fn lint_only(
&directives,
settings,
noqa,
source_kind,
Some(source_kind),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this is optional, but wanted to avoid increasing the scope of this even more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason this is optional is because there's no --add-noqa support for Jupyter Notebooks yet. The logic for add_noqa_to_path basically calls into check_path and as far as I recall there were some circular dependency problems. Maybe it could be solved now with the recent changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add_noqa_to_path detect the PySourceType and exit early (with a warning) if it is a Jupyter notebook to avoid that all downstream code has to handle optional source kinds?

pub(crate) fn updated(&self, new_source: String, source_map: &SourceMap) -> Self {
match self {
SourceKind::Jupyter(notebook) => {
let mut cloned = notebook.clone();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This for sure is more expensive than the old "update in place".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my main reason to update in place was to avoid cloning the entire notebook. Actually, the only thing we need to update with each linting cycle are the cell markers and the transformed source code can be looped over to the next iteration cycle. The cell content can be updated at the end before we actually update the Notebook that too only when --fix is passed.

Using the above idea, we could create a NotebookView which would only contain the necessary fields (for now I think it's cell markers only and the concatenated source code?) and should be much cheaper to clone. We would not even clone it and use an updated method on the view with the SourceMap and transformed source code and create a new view with the updated source code and cell markers. Once the linter iteration is complete, we can apply the view onto the Notebook, update the cell content then and write the Notebook back to the disk.

I think the abstraction would be to have a view into the source code be it Python or Jupyter and the view would have updated method to update the view. At the end the view will be applied to the concrete type and saved onto the disk.

This is just a rough idea based on my morning walk :)

Copy link
Member Author

@MichaReiser MichaReiser Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I'm able to follow. Do you want to take a stab at this together with performing the renaming that you suggested? Or is this something that needs changing before landing this PR?

I tried to change raw to an Rc to get cheap cloning. However, it turns out that update_cell_content mutates the raw cells. The only larger structure that doesn't seem to get updated is valid_code_cells.

I think we could also move out index from the Notebook because it is only used in Diagnostics and instead have a JupyterIndex that is either Lazy(Notebook) or Built(Index).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to take a stab at this together with performing the renaming that you suggested?

Yeah, I can give it a go next week.

Or is this something that needs changing before landing this PR?

No

The only larger structure that doesn't seem to get updated is valid_code_cells.

What do you mean here by "larger structure"? As valid_code_cells is just a vector of u32.

However, it turns out that update_cell_content mutates the raw cells.

Yeah, this update isn't really required for every linter loop. We can avoid doing that and only update it at the end. This basically uses the concatenated source code and cell markers to update the cell content.

I think we could also move out index from the Notebook because it is only used in Diagnostics and instead have a JupyterIndex that is either Lazy(Notebook) or Built(Index).

Yeah, this is a good idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean here by "larger structure"? As valid_code_cells is just a vector of u32.

With large, I mean any non-fixed size heap allocated data structure that requires a deep clone.

@MichaReiser MichaReiser marked this pull request as ready for review August 17, 2023 08:58
@@ -64,7 +64,7 @@ pub(crate) struct Diagnostics {
pub(crate) messages: Vec<Message>,
pub(crate) fixed: FxHashMap<String, FixTable>,
pub(crate) imports: ImportMap,
pub(crate) source_kind: FxHashMap<String, SourceKind>,
pub(crate) notebooks: FxHashMap<String, Notebook>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable. So this allows us to keep the string content on SourceKind::Python, since we no longer store the Python source kind on here anyway.

Copy link
Member

@charliermarsh charliermarsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm supportive of this. It's really nice to avoid the &mut on the source kind.

@MichaReiser MichaReiser added the internal An internal refactor or improvement label Aug 17, 2023
@MichaReiser
Copy link
Member Author

MichaReiser commented Aug 17, 2023

@dhruvmanila any opinion on this? Any recommendations on how to test the changes?

Copy link
Member

@dhruvmanila dhruvmanila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making these changes :)

In the near future we should move towards finalizing an abstraction layer between different sources and the engines (linter, formatter, etc.).

crates/ruff/src/message/mod.rs Outdated Show resolved Hide resolved
pub(crate) fn updated(&self, new_source: String, source_map: &SourceMap) -> Self {
match self {
SourceKind::Jupyter(notebook) => {
let mut cloned = notebook.clone();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my main reason to update in place was to avoid cloning the entire notebook. Actually, the only thing we need to update with each linting cycle are the cell markers and the transformed source code can be looped over to the next iteration cycle. The cell content can be updated at the end before we actually update the Notebook that too only when --fix is passed.

Using the above idea, we could create a NotebookView which would only contain the necessary fields (for now I think it's cell markers only and the concatenated source code?) and should be much cheaper to clone. We would not even clone it and use an updated method on the view with the SourceMap and transformed source code and create a new view with the updated source code and cell markers. Once the linter iteration is complete, we can apply the view onto the Notebook, update the cell content then and write the Notebook back to the disk.

I think the abstraction would be to have a view into the source code be it Python or Jupyter and the view would have updated method to update the view. At the end the view will be applied to the concrete type and saved onto the disk.

This is just a rough idea based on my morning walk :)

Ok(test_contents(&SourceKind::Python(contents), &path, settings).0)
}

pub(crate) struct TestedNotebook {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good. Thanks for doing this :)

@dhruvmanila
Copy link
Member

Any recommendations on how to test the changes?

My main concern would be to test the linter loop (> 2 iterations) with --fix. The way we could do this is to use a source code where some fixes overlap so that then the other fixes gets applied in the next iteration allowing us to make sure the updates are taking place correctly. I gave a quick test and it's good. There are 4 iteration loops in the following test case. I'll add this test case in another PR.

Cells

import random
import math
try:
    print()
except ValueError:
    pass
import random
import pprint

random.randint(10, 20)
foo = 1
if foo is 2:
    raise ValueError(f"Invalid foo: {foo is 1}")

Command:

cargo run --bin ruff --package ruff_cli -- check --no-cache --isolated --select=I,F,SIM /path/to/test.ipynb --fix --diff

@MichaReiser
Copy link
Member Author

Thanks @dhruvmanila for the extensive test plan. I played through the commands (showing diagnostics, diff, and fixing) and it works as expected

@MichaReiser MichaReiser force-pushed the store-notebooks-in-diagnostics branch from 77ca065 to 83eab8e Compare August 18, 2023 08:35
@MichaReiser MichaReiser changed the title Store notebooks on Diagnostics Refactor SourceKind to store file content Aug 18, 2023
@MichaReiser MichaReiser force-pushed the store-notebooks-in-diagnostics branch 2 times, most recently from fbdec00 to a928045 Compare August 18, 2023 09:28
Base automatically changed from charlie/cow to main August 18, 2023 13:32
@MichaReiser MichaReiser force-pushed the store-notebooks-in-diagnostics branch from a928045 to 49904dc Compare August 18, 2023 13:33
@MichaReiser MichaReiser enabled auto-merge (squash) August 18, 2023 13:33
@MichaReiser MichaReiser merged commit ea72d5f into main Aug 18, 2023
16 checks passed
@MichaReiser MichaReiser deleted the store-notebooks-in-diagnostics branch August 18, 2023 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal An internal refactor or improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants