Add backfill functionality to check admin #7094 #7232

xmunoz · 2020-01-14T07:01:58Z

Add backfill task
Change lookup of checks to check_name instead of id
Load checks that are also in "evaluation" state

- Add backfill task - Change lookup of checks to check_name instead of id - Load checks that are also in "evaluation" state

xmunoz · 2020-01-14T07:19:05Z

Failing on coverage, but I wanted get input about my implementation of the backfill function. There are two ways that this could run:

Calling the run_task method directly, which causes the backfill task to be long-running, synchronously executing each run_check call.
Having the backfill task enqueue a bunch of run_check tasks.

I have the first implementation in this PR for a few reasons. On my local machine, there appears to be a huge overhead incurred by enqueue thousands of tasks. To execute a backfill on 10,000 items took about 15 seconds with the first method, and about 5 minutes with the second. This overhead possibly goes away on production. My other reasons for preferring the first method:

It's easy to kill a single, rogue long-running task, and more difficult to play task whack-a-mole.
Each run_check executes a few read and at least one write db query. This could potentially cause lock contention on the database, requiring task whack-a-mole...
There may be a concern about the rate at which certain types of checks run, e.g. ones that query external APIs. Backfill checks executing synchronously means we get rate-limiting for free!

Let me know what you think!

warehouse/admin/views/checks.py

warehouse/malware/tasks.py

- Log number of runs executed by backfill - Perform basic validation on sample_rate input - Clean up other testing logic.

ewdurbin

Looks good overall, just one concern.

warehouse/admin/views/checks.py

ewdurbin · 2020-01-15T16:38:01Z

Sorry, I missed this comment, and the approach in the backfill task itself...

Failing on coverage, but I wanted get input about my implementation of the backfill function. There are two ways that this could run:

Calling the run_task method directly, which causes the backfill task to be long-running, synchronously executing each run_check call.

Having the backfill task enqueue a bunch of run_check tasks.

I have the first implementation in this PR for a few reasons. On my local machine, there appears to be a huge overhead incurred by enqueue thousands of tasks. To execute a backfill on 10,000 items took about 15 seconds with the first method, and about 5 minutes with the second. This overhead possibly goes away on production. My other reasons for preferring the first method:

I think the primary reason why the first approach is significantly more performant is that we currently aren't doing anything in any of the checks. I suspect that if a check were preforming more substantial work it would likely be problematic.

The second approach is likely more feasible in the long run, though perhaps not via the web UI due to timeouts. I doubt the overhead would disappear in prod, as enqueueing tasks still require network calls to the queue.

It's easy to kill a single, rogue long-running task, and more difficult to play task whack-a-mole.

Each run_check executes a few read and at least one write db query. This could potentially cause lock contention on the database, requiring task whack-a-mole...

There may be a concern about the rate at which certain types of checks run, e.g. ones that query external APIs. Backfill checks executing synchronously means we get rate-limiting for free!

These are all reasonable benefits of the first approach, but it looks to me like a better approach would be to strip out the ability to do a full backfill from the web and make it a CLI entrypoint under the malware module. Perhaps this view should only allow kicking off evaluations with very conservative corpus sizes.

xmunoz · 2020-01-15T16:42:16Z

The second approach is likely more feasible in the long run, though perhaps not via the web UI due to timeouts.

I don't understand the concern about the web UI. The admin simply enqueues the backfill task and returns a view to the user. The backfill execution happens asynchronously. I don't think timeouts are a concern.

ewdurbin · 2020-01-15T17:01:26Z

I don't understand the concern about the web UI. The admin simply enqueues the backfill task and returns a view to the user. The backfill execution happens asynchronously. I don't think timeouts are a concern.

I got mixed up where the iteration was occurring. You're right.

xmunoz · 2020-01-15T17:23:38Z

After more discussion around scalability, we are removing the ability to run full backfills from the project scope, and limiting the admin triggered backfills to a fixed size for check evaluation purposes. Update pending.

- Set backfill size to a fix number, not configurable via web ui. - Backfill task enqueues run_check tasks - Only retry if `check.run` fails, not if loading the check fails. - Use exponential backoff for retries.

warehouse/admin/templates/admin/malware/checks/detail.html

Co-Authored-By: Ernest W. Durbin III <ewdurbin@gmail.com>

* Add backfill functionality to check admin #7094 - Add backfill task - Change lookup of checks to check_name instead of id - Load checks that are also in "evaluation" state * Add unit tests for backfill. - Log number of runs executed by backfill - Perform basic validation on sample_rate input - Clean up other testing logic. * Remove superfluous 'all()' * Code review changes. - Set backfill size to a fix number, not configurable via web ui. - Backfill task enqueues run_check tasks - Only retry if `check.run` fails, not if loading the check fails. - Use exponential backoff for retries. * Update warehouse/admin/templates/admin/malware/checks/detail.html Co-Authored-By: Ernest W. Durbin III <ewdurbin@gmail.com> Co-authored-by: Ernest W. Durbin III <ewdurbin@gmail.com>

* Add backfill functionality to check admin pypi#7094 - Add backfill task - Change lookup of checks to check_name instead of id - Load checks that are also in "evaluation" state * Add unit tests for backfill. - Log number of runs executed by backfill - Perform basic validation on sample_rate input - Clean up other testing logic. * Remove superfluous 'all()' * Code review changes. - Set backfill size to a fix number, not configurable via web ui. - Backfill task enqueues run_check tasks - Only retry if `check.run` fails, not if loading the check fails. - Use exponential backoff for retries. * Update warehouse/admin/templates/admin/malware/checks/detail.html Co-Authored-By: Ernest W. Durbin III <ewdurbin@gmail.com> Co-authored-by: Ernest W. Durbin III <ewdurbin@gmail.com>

* Add backfill functionality to check admin #7094 - Add backfill task - Change lookup of checks to check_name instead of id - Load checks that are also in "evaluation" state * Add unit tests for backfill. - Log number of runs executed by backfill - Perform basic validation on sample_rate input - Clean up other testing logic. * Remove superfluous 'all()' * Code review changes. - Set backfill size to a fix number, not configurable via web ui. - Backfill task enqueues run_check tasks - Only retry if `check.run` fails, not if loading the check fails. - Use exponential backoff for retries. * Update warehouse/admin/templates/admin/malware/checks/detail.html Co-Authored-By: Ernest W. Durbin III <ewdurbin@gmail.com> Co-authored-by: Ernest W. Durbin III <ewdurbin@gmail.com>

* Add new models for malware detection. (#7118) * Add new models for malware detection. Fixes #7090 and #7092. * Code review changes. - FK on release_file.id field instead of md5 - Change message type from String to Text - Change Enum class in model to singular form * Add admin interface to view and enable checks (#7134) * Add admin interface to view and enable checks - Implement list, detail and change_state views (#7133) - Add unit tests for check admin view * Add comprehensive test coverage for check admin * Add initial hook-based check execution mechanism (#7160) * Add initial hook-based check execution mechanism * scratch/poc * Add initial hook-based check execution mechanism * Use sqlalchemy event hooks for malware checks * Fix unit tests * Add enum for MalwareCheckObjectType * Add unit tests for init. * Add tests for tasks, services, and utils. Also, some small bugfixes in MalwareCheckFactory and the get_enabled_checks method. * Fix spurious task test. * Add missing drop enum to downgrade function. * Added TODO to dev/environment * Be more explicit in check lookup Co-authored-by: Ernest W. Durbin III <ewdurbin@gmail.com> * Add malware check syncing mechanism (#7190) * Add malware check syncing mechanism * Code review changes. * Refactor MalwareCheckBase. Fixes #7091. (#7196) * Refactor MalwareCheckBase. Fixes #7091. Add Foreign Keys in MalwareVerdicts for other types of objects (Releases, Projects). * Change verdict dict to kwargs. * Add wipe-out functionality (#7202) * Add wipe-out functionality Related: #7133 * Call list explicitly * Add rudimentary verdicts view. Progress on #6062. (#7207) * Add rudimentary verdicts view. Progress on #6062. Also, add some better testing logic for wiped_out condition. * Code review changes. - Conditionally show fields that are populated - JSON pretty formatting * Fix unit test bug. - Use `get` instead of `filter` to look up verdict by pkey. * simplify unit tests for verdicts view * introduce malware queue (#7227) * introduce malware queue * correct syntax, apparently list of tuples documented doesn't work. * Add backfill functionality to check admin #7094 (#7232) * Add backfill functionality to check admin #7094 - Add backfill task - Change lookup of checks to check_name instead of id - Load checks that are also in "evaluation" state * Add unit tests for backfill. - Log number of runs executed by backfill - Perform basic validation on sample_rate input - Clean up other testing logic. * Remove superfluous 'all()' * Code review changes. - Set backfill size to a fix number, not configurable via web ui. - Backfill task enqueues run_check tasks - Only retry if `check.run` fails, not if loading the check fails. - Use exponential backoff for retries. * Update warehouse/admin/templates/admin/malware/checks/detail.html Co-Authored-By: Ernest W. Durbin III <ewdurbin@gmail.com> Co-authored-by: Ernest W. Durbin III <ewdurbin@gmail.com> * Refactor testing logic #7098 (#7257) - Add `schedule` field to MalwareCheck model #7096 - Move ExampleCheck into tests/common/ to remove test dependency from prod code - Rename functions and classes to differentiate between "hooked" and "scheduled" checks * Event-based Malware check (#7249) * requirements: Introduce yara * [WIP] malware/check: SetupPatternCheck In progress. Introduces SetupPatternCheck, an implementation of an event-based check that scans the `setup.py`s of release files for suspicious patterns. * malware/checks: Give MalwareCheckBase.run/scan args, kwargs * malware: Add check preparation Fiddle with the check/run signature a bit more. * malware/checks: Unpack file path correctly * docker-compose: Override FILES_BACKEND for worker The worker needs to be able to see the "files" virtual host during development so that malware checks can fetch their underlying release files. * [WIP] malware/checks: setup.py extraction * malware/checks: setup_patterns: Fix enum, seek * malware/checks: setup_patterns: Apply YARA rules Each rule match becomes a verdict. * malware/checks: setup_patterns: Prefer get over filter * warehouse/{admin,malware}: Consistent enum names Also enforce uniqueness for enum values. * warehouse/{admin,malware}: More enum changes * tests: Update admin, malware tests * tests: Fix enum, more test fixes * tests: Add prepare tests * malware/changes: base: Unpack id correctly * tests: Begin adding SetupPatternCheck tests * malware/checks: setup_patterns: Fix enum * tests: More SetupPatternCheck tests * warehouse/malware: setup_patterns: Fix enums * tests: More SetupPatternCheck tests * tests: Add license header * malware/checks: setup_patterns: Add TODO * tests: More SetupPatternCheck tests * tests: More SetupPatternCheck tests * tests: Complete extraction tests for SetupPatternCheck * tests: Fix test * malware/checks: Add docstring for prepare * malware/checks: blacken * malware/checks: Document, expand YARA rules * tests, warehouse: Restructure utilities * malware: Order some enums, reduce SetupPatternCheck verdicts * malware/models: Add missing __lt__ * malware/checks: Always embed the model object in the prepared arguments Use it instead of performing a DB request in the check itself. * malware/checks: Avoid raw bytes * malware/changes: Remove unused import * tests: Fixup malware tests * warehouse/malware: blacken * tests: Fill in malware coverage * tests, warehouse: Add a benign verdict for SetupPatternCheck * tests: blacken * Implement scheduled checks #7093 (#7271) * Implement scheduled checks #7093 - Rename `run_backfill` to `run_evaluation` in admin malware view - Modify `run` and `scan` method signatures to accept `**kwargs` - Extend `run_check` to accomodate scheduled check functionality * Reduce unit test flakiness * Code review changes. Also replace `check.hooked_object` with `check.hooked_object.value` in check detail template. * tests, warehouse: enum fixes * Fix lint error Co-authored-by: William Woodruff <william@yossarian.net> * Add verdicts view filtering capabilities #6062. (#7322) * Add verdicts view filtering capabilities #6062. * Code review changes. - Refactor tests to be parametrized. - Pass `_query` to `route_path` in template. - Remove `is None` from filter query, it adds nothing. * Add verdict administrator review. Fixes #6062. (#7339) * Add verdict administrator review. Fixes #6062. - Add new `admin.verdicts.review` endpoint - Change layout of verdict list and detail view and add forms - Change sort order of the MalwareChecks, and update the tests * Code review changes. - Rename MalwareVerdict field `administrator_verdict` to `reviewer_verdict`. - Change verdict review permission from `admin` to `moderator`. * Misc cleanup and TODOs on malware checks. (#7355) * Misc cleanup and TODOs on malware checks. - Change backfill function to invoke `IMalwareCheckService` interface - Add support for `kwargs to `IMalwareCheckService` interface - Rename variable from reserved word `file` to `release_file` - Add `FatalCheckException` for non-retryable exceptions - Replace `MALWARE_CHECK_BACKEND` in dev/environment * Make `IMalwareService` the entrypoint for `run_check` - Add `run_scheduled_check` task that invokes this interface. - Remove useless utility method - Move `FatalCheckException` into warehouse/malware/errors.py. * malware/checks: PackageTurnover skeleton (#7321) * malware/checks: PackageTurnover skeleton * malware/checks: PackageTurnover: Add NOTE * malware/checks: PackageTurnoverCheck: more work * tests: blacken * malware/checks: More PackageTurnoverCheck work * malware/checks: Blacken * malware/checks: Blacken * package_turnover: Promote from indeterminate to threat * tests: Begin adding package_turnover tests * tests: Add remaining package_turnover tests * tests: Drop unused imports * warehouse: Drop (ww) from NOTE * checks/package_turnover: Drop NOTE Co-authored-by: Cristina <hi@xmunoz.com> Co-authored-by: William Woodruff <william@yossarian.net>

Add backfill functionality to check admin #7094

42cffbf

- Add backfill task - Change lookup of checks to check_name instead of id - Load checks that are also in "evaluation" state

woodruffw reviewed Jan 14, 2020

View reviewed changes

warehouse/admin/views/checks.py Outdated Show resolved Hide resolved

woodruffw reviewed Jan 14, 2020

View reviewed changes

warehouse/malware/tasks.py Outdated Show resolved Hide resolved

xmunoz added 2 commits January 14, 2020 10:48

Add unit tests for backfill.

7887664

- Log number of runs executed by backfill - Perform basic validation on sample_rate input - Clean up other testing logic.

Remove superfluous 'all()'

55a955e

xmunoz requested a review from ewdurbin January 14, 2020 20:03

ewdurbin reviewed Jan 14, 2020

View reviewed changes

warehouse/admin/views/checks.py Show resolved Hide resolved

Code review changes.

a2c762c

- Set backfill size to a fix number, not configurable via web ui. - Backfill task enqueues run_check tasks - Only retry if `check.run` fails, not if loading the check fails. - Use exponential backoff for retries.

ewdurbin reviewed Jan 16, 2020

View reviewed changes

warehouse/admin/templates/admin/malware/checks/detail.html Outdated Show resolved Hide resolved

Update warehouse/admin/templates/admin/malware/checks/detail.html

f642d95

Co-Authored-By: Ernest W. Durbin III <ewdurbin@gmail.com>

ewdurbin merged commit e96749f into pypi:malware-detection Jan 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add backfill functionality to check admin #7094 #7232

Add backfill functionality to check admin #7094 #7232

xmunoz commented Jan 14, 2020

xmunoz commented Jan 14, 2020

ewdurbin left a comment

ewdurbin commented Jan 15, 2020

xmunoz commented Jan 15, 2020 •

edited

Loading

ewdurbin commented Jan 15, 2020

xmunoz commented Jan 15, 2020 •

edited

Loading

Add backfill functionality to check admin #7094 #7232

Add backfill functionality to check admin #7094 #7232

Conversation

xmunoz commented Jan 14, 2020

xmunoz commented Jan 14, 2020

ewdurbin left a comment

Choose a reason for hiding this comment

ewdurbin commented Jan 15, 2020

xmunoz commented Jan 15, 2020 • edited Loading

ewdurbin commented Jan 15, 2020

xmunoz commented Jan 15, 2020 • edited Loading

xmunoz commented Jan 15, 2020 •

edited

Loading

xmunoz commented Jan 15, 2020 •

edited

Loading