[Task]: Improve generation to generate smaller MLBF files #15261

Rob--W · 2024-12-25T18:24:31Z

Description

Recently, a new MLBF was published whose size increased by almost 100 KB in comparison to the previous MLBF. This was flagged as a performance regression in https://bugzilla.mozilla.org/show_bug.cgi?id=1938506.

EDIT: The analysis below was based on the assumption that the MLBF grew due to unexpected changes in the ratio. After checking the data in production, it seems more likely that there is a bug in the generation code that results in duplicate entries, which can also skew the ratio for no good reason. See #15261 (comment)

After taking a closer look, it looks like there is an actionable way towards reducing the size of the MLBF data. The full analysis is in https://bugzilla.mozilla.org/show_bug.cgi?id=1938506#c5.

Summarized, to immediately reduce the size of the hard block MLBF in production, https://github.com/mozilla/addons-server/blob/0f718e347cde838085c9f8b2f5eec8fb45f125b4/src/olympia/blocklist/mlbf.py#L55-L57 can be changed as follows:

-        cascade.set_crlite_error_rates(
-            include_len=error_rates[0], exclude_len=error_rates[1]
-        )
+        # See https://bugzilla.mozilla.org/show_bug.cgi?id=1938506#c5
+        cascade.set_error_rates([0.25, 0.5])

To make sure that the improvement is a strict improvement (i.e. that the above change does not negatively affect other bloomfilters for different inputs), a way to do so is to run the MLBF generation logic multiple times, and picking the smallest of all sizes.

Acceptance Criteria

Milestones/checkpoints

Give feedback

Generated addons-mlbf.bin should be smaller, to fix regression reported in https://bugzilla.mozilla.org/show_bug.cgi?id=1938506
Generated softblocks-addons-mlbf.bin should not increase significantly, other than what is expected due to additional data.
Add ability to verify no duplicate entries on MLBF cache.json #15281
Options

Checks

If I have identified that the work is specific to a repository, I have removed "repository:addons-server" or "repository:addons-frontend"

┆Issue is synchronized with this Jira Task

The text was updated successfully, but these errors were encountered:

KevinMind · 2025-01-06T09:30:18Z

If I understand your analysis correctly @Rob--W the root cause of this is a decrease in the ratio of hard blocked to total versions which forces the filter to create more and larger hashes to get accurate results. Essentially the haystack is growing and the needle is still a needle. This behavior is not related to changes made in the last few months to the MLBF logic or how we are setting the error rates.

I was looking at how we have changed the logic for setting the error rates and it looks like this has been consistent since at least July 2023 (probably longer) other than the fact that we do not set error rates if either the include or exclude list is empty as this causes errors in filter generation. That should not be relevant here.

Further, it seems like we could try to integrate the formula you are using to compute the new error rate into the code so we don't need to continuously update the constant. I need to read a bit deeper but I would assume if we simply hard coded this value we solve the problem now but it could resurface if the ratio of blocked to total versions changes significantly again.

Can you sense check this for me? Am I missing something obvious so far?

Final point, we have stress testing for the filter and could add filter size testing as well to ensure not only we create the filter in a reasonable time but that it meets a standard for size relative to the underlying data set. Maybe that could also be useful here.

Rob--W · 2025-01-06T11:54:13Z

If I understand your analysis correctly @Rob--W the root cause of this is a decrease in the ratio of hard blocked to total versions which forces the filter to create more and larger hashes to get accurate results.

To definitely confirm this, we need to know the actual number of hard blocks vs non-hard blocks. With that the generation process can be simulated again to estimate the numbers. Due to the construction, unless the inputs are exactly the same, it is possible for the result to be slightly off. But I think that it is close enough to be meaningful.

Essentially the haystack is growing and the needle is still a needle. This behavior is not related to changes made in the last few months to the MLBF logic or how we are setting the error rates.

Likely independent of changes, but before discarding this possibility I'd like to verify/confirm the numbers. E.g. if refactoring resulted in too many items being considered "blocked", that could also affect the ratio to trigger the bug.

I was looking at how we have changed the logic for setting the error rates and it looks like this has been consistent since at least July 2023 (probably longer) other than the fact that we do not set error rates if either the include or exclude list is empty as this causes errors in filter generation. That should not be relevant here.

I traced the history, and the logic has not changed significantly since 2020. The last change/refactor that touched this logic was mozilla/addons-server#14108.

Further, it seems like we could try to integrate the formula you are using to compute the new error rate into the code so we don't need to continuously update the constant. I need to read a bit deeper but I would assume if we simply hard coded this value we solve the problem now but it could resurface if the ratio of blocked to total versions changes significantly again.

I showed multiple formulas in https://bugzilla.mozilla.org/show_bug.cgi?id=1938506#c5, so it is not clear which one you are suggesting to include. The remainder of my comment here refers to functions and terminology from my Bugzilla comment, so if unclear, please Ctrl+F on Bugzillla.

set_crlite_error_rates and calc_n_hashes are directly relevant to the size in this case. I also mentioned calc_size without further analysis, but that can also affect the size of the result.

Can you sense check this for me? Am I missing something obvious so far?

The error_rates option (that specifies the falsePositiveRate) can be set to avoid the need to work against set_crlite_error_rates. IF the number of hashes is the primary cause of the size regression, a potential approach is to generate multiple error_rates combinations such that calc_n_hashes returns a suggestion close to the current estimate. E.g. if current nhashes is 3, then we try an error rate such that nhashes would be computed as 2. Or we could also simply iterate from 1 to (e.g.) 3.

Final point, we have stress testing for the filter and could add filter size testing as well to ensure not only we create the filter in a reasonable time but that it meets a standard for size relative to the underlying data set. Maybe that could also be useful here.

SGTM.

FYI: I have Python scripts to simulate generation and to analyze a MLBF (file) data. Let me know if you are interested in either, then I'll upload it somewhere.

Rob--W · 2025-01-11T00:18:38Z

I received the actual inputs that were used as part of the generation (cache.json as written by the MLBF loader).

From the public MLBF I extracted the hash, and combined this with the data in cache.json (include = blocked, exclude = not_blocked + soft_blocked) to generate a new MLBF.

With the 7 nov 2024 data, my generated MLBF is identical to the one in prod, with the MLBF consisting of 845025 bytes.

But with the 17 dec 2024 data, my generated MLBF has 847859 bytes (which looks very reasonable and is what I expect), but the actual MLBF has 931343 bytes.

I think that there may be a bug somewhere in src/olympia/blocklist/mlbf.py that causes the exclude list to contain duplicates. Since the MLBF size is directly dependent on the sizes of include and exclude (which are list, not set), duplicate entries can result in MLBF sizes that are larger than they should be. In the mlbf.py code, I see suspicious code that suggests that there are indeed duplicates under some circumstances:
https://github.com/mozilla/addons-server/blob/ffd993045cef73fb7635e4a94ac95e1a3a9073ec/src/olympia/blocklist/mlbf.py#L186-L189

As a work-around to this bug, I think that we could de-duplicate the include and exclude inputs to the generate_mlbf function. That can resolve the immediate regression. The proper fix is to prevent duplicates from appearing in the inputs in the first place.

Rob--W added needs:info repository:addons-server Issue relating to addons-server labels Dec 25, 2024

data-sync-user assigned wagnerand and unassigned wagnerand Jan 8, 2025

data-sync-user removed the needs:info label Jan 9, 2025

KevinMind self-assigned this Jan 13, 2025

This was referenced Jan 13, 2025

Add ability to verify no duplicate entries on MLBF cache.json #15281

Open

Add MLBF validation logic and corresponding management command mozilla/addons-server#22983

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task]: Improve generation to generate smaller MLBF files #15261

[Task]: Improve generation to generate smaller MLBF files #15261

Rob--W commented Dec 25, 2024 •

edited by KevinMind

Loading

Milestones/checkpoints

KevinMind commented Jan 6, 2025 •

edited

Loading

Rob--W commented Jan 6, 2025

Rob--W commented Jan 11, 2025

[Task]: Improve generation to generate smaller MLBF files #15261

[Task]: Improve generation to generate smaller MLBF files #15261

Comments

Rob--W commented Dec 25, 2024 • edited by KevinMind Loading

Description

Acceptance Criteria

Milestones/checkpoints

Checks

KevinMind commented Jan 6, 2025 • edited Loading

Rob--W commented Jan 6, 2025

Rob--W commented Jan 11, 2025

Rob--W commented Dec 25, 2024 •

edited by KevinMind

Loading

KevinMind commented Jan 6, 2025 •

edited

Loading