-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add feature flag to enable legacy md5 hash for anonymous user id #30832
feat: add feature flag to enable legacy md5 hash for anonymous user id #30832
Conversation
Thanks for the pull request, @kaustavb12! Please note that it may take us up to several weeks or months to complete a review and merge your PR. Feel free to add as much of the following information to the ticket as you can:
All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here. Please let us know once your PR is ready for our review and all tests are green. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
- I tested this: verified that new hashes are generated with MD5 when the feature flag is enabled
- I read through the code
- I checked for accessibility issues: n/a
- Includes documentation
- I made sure any change in configuration variables is reflected in the corresponding client's
configuration-secure
repository: n/a
@natabene, this is ready for your review. |
@kaustavb12 Thank you for your contribution. |
ba802b7
to
b1b41bf
Compare
…to md5 The hashing algorithm has been changed in cd60646. However, there are Open edX operators who maintain backward compatibility of anonymous user IDs after past rotations of their Django secret key. For them, altering the hashing algorithm was a breaking change that made their analytics inconsistent.
b1b41bf
to
746e4fe
Compare
@kaustavb12 @Agrendalath I don't really understand the need for this change. As I understand, all past anonymous IDs are stored in a database table and won't be changed, even though new ones use a new algorithm. So why can't the operator here load the IDs from the database table for their analytics instead of depending on the details of the hash algorithm? |
@bradenmacdonald, this operator is using a custom analytics system that is integrated with multiple external services. As far as I know, some of these services were using direct user IDs in the past. Therefore, the operator calculates MD5 hashes directly in their database and stores them to maintain the integrity of all existing data. There was a key rotation in the past that changed the anonymous user IDs, so they also have lots of legacy code to make these IDs backward compatible with all possible versions. Hashes are calculated in the database, but database engines are not adapting new hashing functions too fast. For example, SHA-2 was accepted in 2002 (FIPS PUB 180-2), but it appeared in MySQL in 2010 (v5.5.5), and PostgreSQL didn't fully support its variants until 2018. As SHA-3 (FIPS PUB 202) was published in August 2015, it's not likely to see it fully supported soon. Therefore, it would be necessary to pre-calculate these hashes, instead of preparing them directly in the DB query. We don't have details about how exactly these analytics work, though, as we're not managing them, nor even have any access to the code. |
@Agrendalath Thanks, though I am still still having trouble understanding why there's any need to calculate hashes within a DMBS at all. Can't there just be a table in every database/analytics system that needs one, which contains the map of real_id, anyonmous_id for all anonymous ids that have ever been used? i.e. use JOIN/lookups instead of hashing/calculations. In other words, the root problem here is that the external systems were integrated in a way that was too highly coupled to the internal details of how Open edX calculates its anonymous IDs. So it seems like the right fix is to un-couple them from that, rather than add complexity to Open edX by supporting two hash mechanisms. However, I won't block the PR on it - I see it's only a couple lines of code in the end. |
# .. toggle_description: Whether to enable the legacy MD5 hashing algorithm to generate anonymous user id | ||
# instead of the newer SHAKE128 hashing algorithm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to give a brief explanation here of why that might be desirable for someone.
# instead of the newer SHAKE128 hashing algorithm | ||
# .. toggle_use_cases: open_edx | ||
# .. toggle_creation_date: 2022-08-08 | ||
# .. toggle_target_removal_date: None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removal date can be left off for non-temporary toggles.
I share Braden's concerns -- this seems like a really unusual use-case, and I'd want to have a clearer picture of why this can't be solved with table lookups or a sync job before I'd be comfortable adding a toggle that enables an old hash algorithm. (It also sounds like the analytics system might need to have a copy of all current and former |
Thanks @timmc-edx. We are discussing this change internally as well as with our client to see if the approach can be modified based on suggestions from you and @bradenmacdonald. I'll keep you posted on the same. |
Based on your and @bradenmacdonald's suggestions, we had a discussion with our client on the approach to follow here. And it was decided that it would be best for the analytics system to pull the required data from DB rather than calculating the hashes using the Therefore we are closing this PR. Thanks a lot to you and @bradenmacdonald for your reviews and suggestions. cc. @Agrendalath |
@kaustavb12 Even though your pull request wasn’t merged, please take a moment to answer a two question survey so we can improve your experience in the future. |
Description
PR #26198 changed the hash algorithm to generate an anonymous user id from MD5 to SHAKE-128.
However, there are Open edX operators who maintain backward compatibility of anonymous user IDs after past rotations of their Django secret key. For them, altering the hashing algorithm was a breaking change that made their analytics inconsistent.
This PR, introduces a feature flag, to enable Open edX operators to switch back to the legacy MD5 algorithm if they so require.
Supporting information
Link to other information about the change, such as Jira issues, GitHub issues, or Discourse discussions.
Be sure to check they are publicly readable, or if not, repeat the information here.
Testing instructions
/edx/etc/lms.yml
Features
add the new FlagDeadline
"None" if there's no rush, or provide a specific date or event (and reason) if there is one.
Other information
Private Ref: BB-6533