Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User accounts deduping #3246

Open
AbdBarho opened this issue May 28, 2023 · 10 comments
Open

User accounts deduping #3246

AbdBarho opened this issue May 28, 2023 · 10 comments
Assignees

Comments

@AbdBarho
Copy link
Collaborator

AbdBarho commented May 28, 2023

Context

There are some users who are facing problem when logged in through multiple methods, i.e. email first and then google / discord second, or the other way around.
#3136 #3232 #2637 #1695 #1268 and probably many more. We don't want their efforts to go to waste.

In the frontend db we have users and accounts, in theory, it should be possible for a user to have multiple accounts from different providers and still be the same user.

However, the way we identify users in the data backend is a combination of user id and auth method, which makes it harder to reconnect the two.

This problem is more pronounced when taking the discord bot into consideration, since it has no information about the user other than their discord id.

For chat, we only use the user's frontend db id, which should make it easier.

Solution Idea

Ideally, we should be able to identify a user through our entire stack using a single piece of information, which is identical through all auth methods we provide.

I see the email address being the only valid option (users might change their emails), which is used by next-auth for connecting multiple accounts to the same user.

This would require at least:

  • Update user auth frontend <--> data backend
  • Change auth headers sent
  • Merge potential duplicate users using their email addresses while preserving their contributions & privileges
  • Update tables
  • Update user auth frontend <--> inference

This will probably be a gradual process where we would have to support both old and new formats while we do the migration.

Open questions

  • What should this unique identifier be
    • email?
    • id of user in frontend db (since it is the entrypoint for all user interactions)
    • Or maybe a separate "user" service?
  • Are there some solutions out there that already solve this problem?

@andreaskoepf @yk @notmd @olliestanley would love to hear your opinions on this, since it is a huge topic and basically touches everything.

@notmd
Copy link
Collaborator

notmd commented May 28, 2023

Have you tested what is the next auth behavior if we remove the unique index from the email column?

@AbdBarho
Copy link
Collaborator Author

Have you tested what is the next auth behavior if we remove the unique index from the email column?

I can test it locally, but that would not help much. That being said, I think leaving it unique is the right choice, we don't want multiple users with the same email.

@notmd
Copy link
Collaborator

notmd commented May 28, 2023

When the user login Discord. If they change their discord email and login again, next-auth still matches to old account and doesn't update the email.

@AbdBarho
Copy link
Collaborator Author

AbdBarho commented Jun 9, 2023

Discussion from meeting 09.06.2023:

The main point should be to use the same ID across the entire stack.
We should stick to the ID from the data backend if possible.

Merging multiple accounts in the backend should be possible, if we know the corresponding frontend ids.

I will try to analyze the databases and see how much overlap do we have and how easy would it be to migrate.

@notmd
Copy link
Collaborator

notmd commented Jun 10, 2023

We should stick to the ID from the data backend if possible.

Sorry I couldn't join the meeting last night, but may I ask why we decided to stick with the ID from the backend? Especially we are wanting to decouple the frontend and the backend to allow deploy the chat app only.

@AbdBarho
Copy link
Collaborator Author

AbdBarho commented Jun 10, 2023

because the ID in the backend is the hardest to change, since it is connected to everything else.

"if possible" because if that does not work, we can still use frontend id.

I imagine using the frontend id being the easiest, and then we could just take the frontend db and make it to its own user service if we want to.

@notmd
Copy link
Collaborator

notmd commented Jun 10, 2023

Then I would suggest the backend store the frontend id, then identify the user via a header as we would do now. It still uses the old id for internal usage

@AbdBarho
Copy link
Collaborator Author

I think like you said we can start by adding the frontend user id.

in the mean time, I will try to see how we can get rid of duplicate accounts.

I imagine we add an endpoint into the backend, and do a post request from frontend with all of the accounts & provider that the user has, and make sure that all of these accounts get merged in data backend, and the frontend user id is also saved somewhere.

Hopefully it is easily doable

@AbdBarho
Copy link
Collaborator Author

Here is a dump of all the stuff I found out in the last 2 day:

Context

we have roughly 6000 users (and growing) who have more than one account and are affect by the duplication.

However, all of these users, regardless of how many providers or accounts of providers they used for login, they have a maximum of 2 accounts in the backend.

This is because we always use the first provider

return {
id: user.accounts[0].providerAccountId,
display_name: user.name,
auth_method: user.accounts[0].provider as AuthMethod,
};

This also means, that users who logged in with email first, and then discord, the first account in the data backend that references the email is basically never used again, and the data is orphaned.

Merging

The backend uses the its own user id as foreign key in almost all tables.

The "easy" way to merge could be to simply update the id of all users to the same id, but this won't work because we don't have cascade updates on the user id to other tables (easy to solve), and would cause duplicate ids in the user table which is unique index (hard to solve).

The "hard" way would be to do this manually, with a rough pseudo code:

users = # get all duplicates
id = users[0].id
for user in users[1:]:
   for table in [messages, tasks, troll_stats, ...]:
      table.where(table.user_id == id).set(id=id)
   userstable.delete(user)

And it seems that this is the only way, but I might be wrong.

API

The user identification over the api boundary uses the auth method and provider id

"X-OASST-USER": `${user.auth_method}:${user.id}`,

Possible values:

auth_method id
"discord" discord user id
"google" google user id
"local" user id in frontend db

This means that even if we merge the users, the backend would still not now that the users are merged, and would create new ones.

We need a new policy of communicating with the backend, ideally we use a single identifier for cross boundary communication, two options:

  • backend user id
    • probably the easiest, the update to any endpoints should be minimal
    • we could also save the backend user id into the frontend db to make our life easier.
  • frontend user id, this would require:
    • add frontend id column to user table in data backend
    • push all frontend user id to the data backend (hard and fragile)
    • change user lookup in data backend to use frontend id

Gradual Migration?

Maybe it would be easy if we can start by solving the problem for new users, preventing the duplication from happening in the first place.

Saving the backend id in the frontend db seems to be the easiest way, if we find it the database, we use it, otherwise fallback to the old method for old users.

Next steps

I will get some sleep.

Getting ideas and suggestions from the team would greatly appreciated.

@AbdBarho AbdBarho self-assigned this Jun 11, 2023
andreaskoepf added a commit that referenced this issue Jun 14, 2023
First backend support for #3246 : Adds a new `/api/v1/admin/merge_users`
endpoint to merge one or more source user accounts into an existing
destination user account. The source user accounts are deleted in the
process and all objects that belonged to the source users are
transferred to the destination user. Rows belonging to the source
accounts in the `user_stats` and `troll_stats` tables are also deleted.
@andreaskoepf
Copy link
Collaborator

@AbdBarho Merging backend users is now possible via POST to /api/v1/admin/merge_users, see #3475.

For the "X-OASST-USER" header I suggest that we use uuid:<backend_user_uuid>, e.g. use "uuid" as auth-method. If that is feasible for the frontend I could update the backend to accept this fromat for existing (just doing lookup via uuid) + new users (actually storing the "uuid" as auth_method for new users). I have to go through the API calls that pass a user objects (which contains the display_name) and find those which indirectly create a new user account.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

No branches or pull requests

3 participants