Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #3314 duplicate code error only shows up with pylint jobs 1 #3458

Conversation

doublethefish
Copy link
Contributor

Steps

  • Add yourself to CONTRIBUTORS if you are a new contributor.
  • Add a ChangeLog entry describing what your PR does.
  • If it's a new feature or an important bug fix, add a What's New entry in doc/whatsnew/<current release.rst>.
  • Write a good description on what the PR does.

Description

This PR allows recombining checker-data across checker instances, such as SimilarChecker, fixing #3314.

This means checkers that need to do analysis across all files can do so, even in a multiprocessing environment. We achieve this through a map/reduce mixin solution.

There are two key changes:

  1. check_parallel collates map data from each checker that supports it, and passes the collated data to a reduce classmember function on the associated checker
  2. MapReduceMixin is an opt-in object that specifies the interface that is looked for by check_parallel

One other set of changes are minimised changes to SimilarChecker and its tests. The goal here was to reduce the size of the patch and reuse as much existing code as possible - that said the code could do with modernising before further changes.

The other mentionable change is new set of tests for check_parallel. Looking at most of the other issues for job-related problems they seem to be configuration and cl-argument related e.g. #374 and #2674, it's not clear to me how to put some of those issues under test, yet - any help appreciated here.

To summarise: we previously had 'map' being invoked, here we add 'reduce' support by collecting the map-data and then passing it to reducer function, if available on the checker object - determined by whether they confirm to the MapReduceMixin mixing interface or nor.

On a personal note, I suspect that the memory implications of this might cause issues for large projects, perhaps even having implications on performance (see #2525), but let's get multiprocess/--jobs working properly first and deal with memory and performance considerations later - I say this as there are many quick wins we can make here, e.g. hashing lines, split work by available procs rather than by file, and so on.

Type of Changes

Type
🐛 Bug fix

Related Issue

Closes #3314

@doublethefish doublethefish force-pushed the bug/3314_duplicate_code_error_only_shows_up_with_pylint_jobs_1 branch 2 times, most recently from bc7725d to 02e055a Compare March 28, 2020 15:08
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.03%) to 90.245% when pulling 02e055a on doublethefish:bug/3314_duplicate_code_error_only_shows_up_with_pylint_jobs_1 into 53e5406 on PyCQA:master.

@coveralls
Copy link

coveralls commented Mar 28, 2020

Coverage Status

Coverage increased (+0.02%) to 90.856% when pulling cd21a1d on doublethefish:bug/3314_duplicate_code_error_only_shows_up_with_pylint_jobs_1 into fed4438 on PyCQA:master.

@doublethefish doublethefish force-pushed the bug/3314_duplicate_code_error_only_shows_up_with_pylint_jobs_1 branch 2 times, most recently from 2954120 to 9b71535 Compare April 1, 2020 00:13
pylint/lint.py Outdated Show resolved Hide resolved
@doublethefish
Copy link
Contributor Author

doublethefish commented Apr 3, 2020

I have performance analysis work that should go before this PR.
I have changed my mind on this. The performance work can be separated out.

@doublethefish doublethefish force-pushed the bug/3314_duplicate_code_error_only_shows_up_with_pylint_jobs_1 branch 2 times, most recently from bcfd6e7 to 474f2d0 Compare April 8, 2020 10:30
@doublethefish doublethefish changed the title Bug/3314 duplicate code error only shows up with pylint jobs 1 Fix #3314 duplicate code error only shows up with pylint jobs 1 Apr 8, 2020
@doublethefish doublethefish force-pushed the bug/3314_duplicate_code_error_only_shows_up_with_pylint_jobs_1 branch from 474f2d0 to 76168d7 Compare April 19, 2020 14:15
@Pierre-Sassoulas
Copy link
Member

I approve of the changes proposed, this is great work, probably huge for performance on multi core machines on top of fixing a bug. But I just can't review this much clever code in a reasonable time. Do you have an idea of something that could be separated easily? If some refactor would help to make the implementation less verbose, we could merge them before the main change.

Surface remark : you don't have to update the copyright manually yourself, there is some kind of script that does it.

@doublethefish
Copy link
Contributor Author

Do you have an idea of something that could be separated easily?

Yes. Absolutely. It will be the changes to lint.py and the get_map_data/reduce_map_data functions in similar.py. That's ~50/60 lines all in. Everything else is tests and work in support of that testing.

Surface remark : you don't have to update the copyright manually yourself, there is some kind of script that does it.

Gotcha :)

I approve of the changes proposed, this is great work, probably huge for performance on multi core machines on top of fixing a bug.

Unfortunately performance doesn't improve with this, neither with batching files (another PR I have ready, but dependent upon this PR), at least not in real-world use cases sadface.

Performance is a huge topic for pylint, when I started looking at it I thought it might be an easier problem than it is. Issue #2525 has some good notes on this. For example, some rough benchmarking suggests batching-files in -jN mode improves the performance by ~13%.... but that is only in the pylint wrapper system. This is because the astroid checkers seems to spend the same amount of time for each sub-proc as a non-threaded run meaning wall-clock time remains constant for all -j[1.. N]! (as #2525 states).

Performance work will take some collaboration it seems.

@doublethefish doublethefish force-pushed the bug/3314_duplicate_code_error_only_shows_up_with_pylint_jobs_1 branch from 76168d7 to 9912b1f Compare April 19, 2020 16:26
@Pierre-Sassoulas
Copy link
Member

Pierre-Sassoulas commented Apr 19, 2020

Thank you for the detailed explanation. If that is not too much work could you create a small merge request with the changes to lint.py and the get_map_data/reduce_map_data functions in similar.py ? Maybe even the typos. We could get that out of the way first.

Performance work will take some collaboration it seems.

After looking at the ticket you linked it seems that there is a clear path to make performance better. I would not call it a low hanging fruit, but there's a fruit ! Wouldn't it make sense to do this ticket before so we can see the actual result of parallelization and batching?

By the way, how are you testing the changes in performances ? I launch a local analysis on pylint and on apache spark. Which is probably a mediocre test... I'm looking for a fast way to see if the perfs are getting better.

@doublethefish
Copy link
Contributor Author

doublethefish commented Apr 19, 2020

If that is not too much work could you create a small merge request with the changes to lint.py and the get_map_data/reduce_map_data functions in similar.py ?

I've updated this PR with the minimal changes. We can get the rest in after as separate commits.

Edit: ~100 lines of the PR is code changes. There are two input-files which are the majority of the rest of the lines

pylint/lint.py Outdated Show resolved Hide resolved
@doublethefish
Copy link
Contributor Author

doublethefish commented Apr 19, 2020

After looking at the ticket you linked it seems that there is a clear path to make performance better. I would not call it a low hanging fruit, but there's a fruit ! Wouldn't it make sense to do this ticket before so we can see the actual result of parallelization and batching?

Yes. Although I'm not 100% sure there's a clear path, yet :)

By the way, how are you testing the changes in performances ? I launch a local analysis on pylint and on apache spark. Which is probably a mediocre test... I'm looking for a fast way to see if the perfs are getting better.

Have a look at wip PR #3473. I've tried to clarify the full range of considerations and problems - it may still not be very clear what is going on there, but look at the description and the individual commits behind that PR, it should tell a story (I try to tell a story with each commit, especially on complex work). Where I say ~13% gain with batching, I determined that via pytest-benchmark on a couple of machines, and with a carefully designed perf-test, but it only shows part of the picture.

... for the other part, if you've not seen it yet, I generated this performance-heat map for a pylint run against numpy under -j1:
https://pylint.s3.amazonaws.com/doublethefish/pylint/output.svg

That perf-chart is pretty much what I see irrespective of how I run performance analysis, but I'm lacking a good understanding of some of what pylint is doing under the hood (ast-walking and so on).

@Pierre-Sassoulas
Copy link
Member

I've started by reviewing #3473. The perf chart is very good, can you explain how you generated it? In #3013 I was using time python3 -m cProfile -s tottime pylint/__main__.py pylint/, I suppose you're creating the chart based on a similar profile ?

@doublethefish
Copy link
Contributor Author

I suppose you're creating the chart based on a similar profile ?

Yes. In general it is really simple thanks to pytest-profiling. Just add the "-profile-svg" switch to the pytest run once that pytest plugin is installed. Because it's so simple (except for getting it on S3) we use travis to generate it for us. E.g.

python -Wi -m pytest --exitfirst --failed-first --benchmark-enable {toxinidir}/tests {posargs:} --profile-svg --benchmark-group-by="group" --benchmark-save-data --benchmark-autosave

You can see how simple it is from this commit on PR #3473 :)

As for the numpy bit, that's best explained by this commit on a branch in my pylint fork (it needs rebasing and tidying and is dependent on #3473). That branch:

  1. Adds a testenv:profile_against_numpy to tox
  2. In tox we clone numpy at @ master
  3. Tox invoke a custom pytest test run, filtered to a single test-file
  4. That test is disabled, in another commit, for all other tests so we do not get the overhead of running against numpy in day to day testing.
  5. The test then gets all numpy's python files and uses pylint.llint.Run() to run with default config, whilst under pytest's profile plugin (which is sweet to use it made me fly a couple of inches off the floor).
  6. The .svg and .prof output is then uploaded to s3 using travis and the artefacts framework in Build failure on Python 3.9 #3457

My explanation is almost larger than the code to do it, which is why I love python so much :D

Having written that, because of its value, it might be worth just doing a numpy-profile PR before discussing in place the profiling regression work. It's easy enough to do, but PyCQA would need to have an S3 account and so on.

@Pierre-Sassoulas
Copy link
Member

Thank you for the very detailed explanation. I understand now, I missed the pytest option that generates the diagram! I don't think pylint have an S3 account (@PCManticore probably knows better than me). But it would also be useful to be able to do it locally. Also maybe it's possible to curl travis's artifacts, logs, or something similar?

@doublethefish
Copy link
Contributor Author

Yep, being able to do that with logs and so on will be useful.

I have a PR that I'm getting ready to just do performance runs against external repos. Until the S3 thing is sorted we can just run it locally.

@doublethefish
Copy link
Contributor Author

See PR #3498

@Pierre-Sassoulas
Copy link
Member

Pierre-Sassoulas commented Apr 27, 2020

Sure thing, check this branch.
There is still a problem with a fixture 'benchmark' not found but I don't think it's caused by a bad rebase. Travis tests passed for my branch probably a problem with tox cache.

@doublethefish
Copy link
Contributor Author

@Pierre-Sassoulas Thank you. I've pulled that, rebased it against latest 2.5 and am testing.

@doublethefish doublethefish force-pushed the bug/3314_duplicate_code_error_only_shows_up_with_pylint_jobs_1 branch from bcb88df to 60a823a Compare May 11, 2020 19:02
@PCManticore PCManticore self-requested a review May 12, 2020 06:52
@Pierre-Sassoulas
Copy link
Member

I think we need to merge #3498 and #3474 before reviewing this.

@Pierre-Sassoulas Pierre-Sassoulas self-assigned this Sep 24, 2020
@brycepg brycepg reopened this Oct 19, 2020
@doublethefish doublethefish force-pushed the bug/3314_duplicate_code_error_only_shows_up_with_pylint_jobs_1 branch from 60a823a to 97bac19 Compare October 21, 2020 08:05
@doublethefish
Copy link
Contributor Author

Hopefully this can finally go in.

Before adding a new mixin this proves the concept works, adding tests as
examples of how this would work in the main linter.

The idea here is that, because `check_parallel()` uses a multiprocess
`map` function, that the natural follow on is to use a 'reduce`
paradigm. This should demonstrate that.
@doublethefish doublethefish reopened this Oct 27, 2020
@doublethefish doublethefish force-pushed the bug/3314_duplicate_code_error_only_shows_up_with_pylint_jobs_1 branch from 0e1cc77 to 08361eb Compare October 27, 2020 08:37
This integrate the map/reduce functionality into lint.check_process().

We previously had `map` being invoked, here we add `reduce` support.

We do this by collecting the map-data by worker and then passing it to a
reducer function on the Checker object, if available - determined by
whether they confirm to the `mapreduce_checker.MapReduceMixin` mixin
interface or nor.

This allows Checker objects to function across file-streams when using
multiprocessing/-j2+. For example SimilarChecker needs to be able to
compare data across all files.

The tests, that we also add here, check that a Checker instance returns
and reports expected data and errors, such as error-messages and stats -
at least in a exit-ok (0) situation.

On a personal note, as we are copying more data across process
boundaries, I suspect that the memory implications of this might cause
issues for large projects already running with -jN and duplicate code
detection on. That said, given that it takes a long time to perform
lints of large code bases that is an issue for the [near?] future and
likely to be part of the performance work. Either way but let's get it
working first and deal with memory and perforamnce considerations later
- I say this as there are many quick wins we can make here, e.g.
file-batching, hashing lines, data compression and so on.
@doublethefish doublethefish force-pushed the bug/3314_duplicate_code_error_only_shows_up_with_pylint_jobs_1 branch from 08361eb to cd21a1d Compare October 28, 2020 19:51
@Pierre-Sassoulas
Copy link
Member

@doublethefish sorry for the delay, busy year ! :) And thanks again for the merge request. I fixed the current conflict on https://github.com/Pierre-Sassoulas/pylint/tree/doublethefish-bug/3314_duplicate_code_error_only_shows_up_with_pylint_jobs_1, could you check that the last commit to fix the error that appeared make sense for you, and update the MR with it or authorize edit on your branch by maintainers :) ?

Except for this conflict, I think this merge request is now ready, what do you think ?

@FPtje
Copy link

FPtje commented Dec 18, 2020

There's a rebase of this PR upon master at https://github.com/FPtje/pylint/tree/bug/3314_duplicate_code_error_only_shows_up_with_pylint_jobs_1

I'm not familiar with the code, but I tried my best to fix things changed due to the rebase:

  • pylint/lint/parallel.py: fixed conflict by accepting both changes, looks like they don't interfere much with each other
  • CONTRIBUTORS.txt: fixed conflict by accepting both changes, it doesn't look like there is a sort order in this file.
  • One commit changes the expected code in test_parallel_execution. Both -j 1 and -j 2 give the same result. Looks like the file was changed to only have errors in it
  • Use GenericTestReporter instead of TestReporter in this commit

It looks like the tests/benchmark/test_baseline_benchmarks.py test throws an error, but that same error is also present in master, so I did not attribute it to this PR.
Edit: Looks like this test failed due to pytest-benchmark not being installed locally. My bad!

Feel free to use the above fork for this PR. I would love to see this merged into master 👍

@Pierre-Sassoulas
Copy link
Member

Merged in #4007, thank you for the great work @doublethefish !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants