Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task]: Improve generation to generate smaller MLBF files #15261

Open
3 tasks
Rob--W opened this issue Dec 25, 2024 · 3 comments
Open
3 tasks

[Task]: Improve generation to generate smaller MLBF files #15261

Rob--W opened this issue Dec 25, 2024 · 3 comments
Assignees
Labels
repository:addons-server Issue relating to addons-server

Comments

@Rob--W
Copy link
Member

Rob--W commented Dec 25, 2024

Description

Recently, a new MLBF was published whose size increased by almost 100 KB in comparison to the previous MLBF. This was flagged as a performance regression in https://bugzilla.mozilla.org/show_bug.cgi?id=1938506.

EDIT: The analysis below was based on the assumption that the MLBF grew due to unexpected changes in the ratio. After checking the data in production, it seems more likely that there is a bug in the generation code that results in duplicate entries, which can also skew the ratio for no good reason. See #15261 (comment)

After taking a closer look, it looks like there is an actionable way towards reducing the size of the MLBF data. The full analysis is in https://bugzilla.mozilla.org/show_bug.cgi?id=1938506#c5.

Summarized, to immediately reduce the size of the hard block MLBF in production, https://github.com/mozilla/addons-server/blob/0f718e347cde838085c9f8b2f5eec8fb45f125b4/src/olympia/blocklist/mlbf.py#L55-L57 can be changed as follows:

-        cascade.set_crlite_error_rates(
-            include_len=error_rates[0], exclude_len=error_rates[1]
-        )
+        # See https://bugzilla.mozilla.org/show_bug.cgi?id=1938506#c5
+        cascade.set_error_rates([0.25, 0.5])

To make sure that the improvement is a strict improvement (i.e. that the above change does not negatively affect other bloomfilters for different inputs), a way to do so is to run the MLBF generation logic multiple times, and picking the smallest of all sizes.

Acceptance Criteria

Milestones/checkpoints

Preview Give feedback

Checks

  • If I have identified that the work is specific to a repository, I have removed "repository:addons-server" or "repository:addons-frontend"

┆Issue is synchronized with this Jira Task

@Rob--W Rob--W added needs:info repository:addons-server Issue relating to addons-server labels Dec 25, 2024
@KevinMind
Copy link
Contributor

KevinMind commented Jan 6, 2025

If I understand your analysis correctly @Rob--W the root cause of this is a decrease in the ratio of hard blocked to total versions which forces the filter to create more and larger hashes to get accurate results. Essentially the haystack is growing and the needle is still a needle. This behavior is not related to changes made in the last few months to the MLBF logic or how we are setting the error rates.

I was looking at how we have changed the logic for setting the error rates and it looks like this has been consistent since at least July 2023 (probably longer) other than the fact that we do not set error rates if either the include or exclude list is empty as this causes errors in filter generation. That should not be relevant here.

Further, it seems like we could try to integrate the formula you are using to compute the new error rate into the code so we don't need to continuously update the constant. I need to read a bit deeper but I would assume if we simply hard coded this value we solve the problem now but it could resurface if the ratio of blocked to total versions changes significantly again.

Can you sense check this for me? Am I missing something obvious so far?

Final point, we have stress testing for the filter and could add filter size testing as well to ensure not only we create the filter in a reasonable time but that it meets a standard for size relative to the underlying data set. Maybe that could also be useful here.

@Rob--W
Copy link
Member Author

Rob--W commented Jan 6, 2025

If I understand your analysis correctly @Rob--W the root cause of this is a decrease in the ratio of hard blocked to total versions which forces the filter to create more and larger hashes to get accurate results.

To definitely confirm this, we need to know the actual number of hard blocks vs non-hard blocks. With that the generation process can be simulated again to estimate the numbers. Due to the construction, unless the inputs are exactly the same, it is possible for the result to be slightly off. But I think that it is close enough to be meaningful.

Essentially the haystack is growing and the needle is still a needle. This behavior is not related to changes made in the last few months to the MLBF logic or how we are setting the error rates.

Likely independent of changes, but before discarding this possibility I'd like to verify/confirm the numbers. E.g. if refactoring resulted in too many items being considered "blocked", that could also affect the ratio to trigger the bug.

I was looking at how we have changed the logic for setting the error rates and it looks like this has been consistent since at least July 2023 (probably longer) other than the fact that we do not set error rates if either the include or exclude list is empty as this causes errors in filter generation. That should not be relevant here.

I traced the history, and the logic has not changed significantly since 2020. The last change/refactor that touched this logic was mozilla/addons-server#14108.

Further, it seems like we could try to integrate the formula you are using to compute the new error rate into the code so we don't need to continuously update the constant. I need to read a bit deeper but I would assume if we simply hard coded this value we solve the problem now but it could resurface if the ratio of blocked to total versions changes significantly again.

I showed multiple formulas in https://bugzilla.mozilla.org/show_bug.cgi?id=1938506#c5, so it is not clear which one you are suggesting to include. The remainder of my comment here refers to functions and terminology from my Bugzilla comment, so if unclear, please Ctrl+F on Bugzillla.

set_crlite_error_rates and calc_n_hashes are directly relevant to the size in this case. I also mentioned calc_size without further analysis, but that can also affect the size of the result.

Can you sense check this for me? Am I missing something obvious so far?

The error_rates option (that specifies the falsePositiveRate) can be set to avoid the need to work against set_crlite_error_rates. IF the number of hashes is the primary cause of the size regression, a potential approach is to generate multiple error_rates combinations such that calc_n_hashes returns a suggestion close to the current estimate. E.g. if current nhashes is 3, then we try an error rate such that nhashes would be computed as 2. Or we could also simply iterate from 1 to (e.g.) 3.

Final point, we have stress testing for the filter and could add filter size testing as well to ensure not only we create the filter in a reasonable time but that it meets a standard for size relative to the underlying data set. Maybe that could also be useful here.

SGTM.

FYI: I have Python scripts to simulate generation and to analyze a MLBF (file) data. Let me know if you are interested in either, then I'll upload it somewhere.

@Rob--W
Copy link
Member Author

Rob--W commented Jan 11, 2025

I received the actual inputs that were used as part of the generation (cache.json as written by the MLBF loader).

From the public MLBF I extracted the hash, and combined this with the data in cache.json (include = blocked, exclude = not_blocked + soft_blocked) to generate a new MLBF.

With the 7 nov 2024 data, my generated MLBF is identical to the one in prod, with the MLBF consisting of 845025 bytes.

But with the 17 dec 2024 data, my generated MLBF has 847859 bytes (which looks very reasonable and is what I expect), but the actual MLBF has 931343 bytes.

I think that there may be a bug somewhere in src/olympia/blocklist/mlbf.py that causes the exclude list to contain duplicates. Since the MLBF size is directly dependent on the sizes of include and exclude (which are list, not set), duplicate entries can result in MLBF sizes that are larger than they should be. In the mlbf.py code, I see suspicious code that suggests that there are indeed duplicates under some circumstances:
https://github.com/mozilla/addons-server/blob/ffd993045cef73fb7635e4a94ac95e1a3a9073ec/src/olympia/blocklist/mlbf.py#L186-L189

As a work-around to this bug, I think that we could de-duplicate the include and exclude inputs to the generate_mlbf function. That can resolve the immediate regression. The proper fix is to prevent duplicates from appearing in the inputs in the first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
repository:addons-server Issue relating to addons-server
Projects
None yet
Development

No branches or pull requests

4 participants