-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Task]: Improve generation to generate smaller MLBF files #15261
Comments
If I understand your analysis correctly @Rob--W the root cause of this is a decrease in the ratio of hard blocked to total versions which forces the filter to create more and larger hashes to get accurate results. Essentially the haystack is growing and the needle is still a needle. This behavior is not related to changes made in the last few months to the MLBF logic or how we are setting the error rates. I was looking at how we have changed the logic for setting the error rates and it looks like this has been consistent since at least July 2023 (probably longer) other than the fact that we do not set error rates if either the include or exclude list is empty as this causes errors in filter generation. That should not be relevant here. Further, it seems like we could try to integrate the formula you are using to compute the new error rate into the code so we don't need to continuously update the constant. I need to read a bit deeper but I would assume if we simply hard coded this value we solve the problem now but it could resurface if the ratio of blocked to total versions changes significantly again. Can you sense check this for me? Am I missing something obvious so far? Final point, we have stress testing for the filter and could add filter size testing as well to ensure not only we create the filter in a reasonable time but that it meets a standard for size relative to the underlying data set. Maybe that could also be useful here. |
To definitely confirm this, we need to know the actual number of hard blocks vs non-hard blocks. With that the generation process can be simulated again to estimate the numbers. Due to the construction, unless the inputs are exactly the same, it is possible for the result to be slightly off. But I think that it is close enough to be meaningful.
Likely independent of changes, but before discarding this possibility I'd like to verify/confirm the numbers. E.g. if refactoring resulted in too many items being considered "blocked", that could also affect the ratio to trigger the bug.
I traced the history, and the logic has not changed significantly since 2020. The last change/refactor that touched this logic was mozilla/addons-server#14108.
I showed multiple formulas in https://bugzilla.mozilla.org/show_bug.cgi?id=1938506#c5, so it is not clear which one you are suggesting to include. The remainder of my comment here refers to functions and terminology from my Bugzilla comment, so if unclear, please Ctrl+F on Bugzillla.
The
SGTM. FYI: I have Python scripts to simulate generation and to analyze a MLBF (file) data. Let me know if you are interested in either, then I'll upload it somewhere. |
I received the actual inputs that were used as part of the generation (cache.json as written by the MLBF loader). From the public MLBF I extracted the hash, and combined this with the data in cache.json (include = blocked, exclude = not_blocked + soft_blocked) to generate a new MLBF. With the 7 nov 2024 data, my generated MLBF is identical to the one in prod, with the MLBF consisting of 845025 bytes. But with the 17 dec 2024 data, my generated MLBF has 847859 bytes (which looks very reasonable and is what I expect), but the actual MLBF has 931343 bytes. I think that there may be a bug somewhere in src/olympia/blocklist/mlbf.py that causes the As a work-around to this bug, I think that we could de-duplicate the |
Description
Recently, a new MLBF was published whose size increased by almost 100 KB in comparison to the previous MLBF. This was flagged as a performance regression in https://bugzilla.mozilla.org/show_bug.cgi?id=1938506.
EDIT: The analysis below was based on the assumption that the MLBF grew due to unexpected changes in the ratio. After checking the data in production, it seems more likely that there is a bug in the generation code that results in duplicate entries, which can also skew the ratio for no good reason. See #15261 (comment)
After taking a closer look, it looks like there is an actionable way towards reducing the size of the MLBF data. The full analysis is in https://bugzilla.mozilla.org/show_bug.cgi?id=1938506#c5.
Summarized, to immediately reduce the size of the hard block MLBF in production, https://github.com/mozilla/addons-server/blob/0f718e347cde838085c9f8b2f5eec8fb45f125b4/src/olympia/blocklist/mlbf.py#L55-L57 can be changed as follows:
To make sure that the improvement is a strict improvement (i.e. that the above change does not negatively affect other bloomfilters for different inputs), a way to do so is to run the MLBF generation logic multiple times, and picking the smallest of all sizes.
Acceptance Criteria
Milestones/checkpoints
Checks
┆Issue is synchronized with this Jira Task
The text was updated successfully, but these errors were encountered: