Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make table column branch.name case sensitive #28633

Closed
wants to merge 2 commits into from

Conversation

wxiaoguang
Copy link
Contributor

@wxiaoguang wxiaoguang commented Dec 28, 2023

The bug was introduced by #22743 , because some databases are default to "case-insensitive"

Fix #28131
Fix #25909
and more

This PR is only a quick fix, it only fixes the "branch" case-insensitive bug. It is not a complete fix for the database collation problem.

To fix various potential "case" bugs in the future, it's better to make all database collation case-sensitive (mainly MySQL/MSSQL)

@wxiaoguang wxiaoguang added this to the 1.22.0 milestone Dec 28, 2023
@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Dec 28, 2023
@pull-request-size pull-request-size bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Dec 28, 2023
@wxiaoguang wxiaoguang force-pushed the fix-branch-case branch 2 times, most recently from f3a702b to e4dd341 Compare December 28, 2023 12:07
@pull-request-size pull-request-size bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 28, 2023
@wxiaoguang wxiaoguang force-pushed the fix-branch-case branch 3 times, most recently from d0ebd57 to f361ab4 Compare December 28, 2023 12:30
@wxiaoguang
Copy link
Contributor Author

CI passes, although this is only a quick/partial fix.

@GiteaBot GiteaBot added lgtm/need 1 This PR needs approval from one additional maintainer to be merged. and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels Dec 28, 2023
@JakobDev
Copy link
Contributor

To fix various potential "case" bugs in the future, it's better to make all database collation case-sensitive (mainly MySQL/MSSQL)

I think XORM should do this automatically

@GiteaBot GiteaBot added lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. and removed lgtm/need 1 This PR needs approval from one additional maintainer to be merged. labels Dec 29, 2023
@algernon
Copy link
Contributor

algernon commented Dec 29, 2023

FWIW, branch names aren't the only thing that break with case insensitivity. /api/v1/repo/issues/search?labels=foo will happily return issues labeled "Foo" when case insensitivity is in play, while it would normally only return issues labeled "foo". It's not as spectacular a bug as the branch names, but a bug nevertheless.

And there are a fair number of other, similar issues present.

You may also want to fix gitea doctor convert while there, because - as far as I see, but correct me if I'm wrong - if the migration that changes the branch table runs first, and then one runs gitea doctor convert (by mistake, or as part of an upgrade process, thinking it is idempotent like migrations), that will reset the collation back to a case insensitive one. At that point, running migrations again won't help, because the database version will be the latest, so the database will be left in a broken state, and will require manual fixing.

@lunny
Copy link
Member

lunny commented Dec 29, 2023

Maybe we should also change the COLLATE from utf8mb4_general_ci to utf8mb4_bin on _, err := x.Exec(fmt.Sprintf("ALTER DATABASE %s CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci"?

@wxiaoguang
Copy link
Contributor Author

@algernon Thank you for the information, it's thoughtful. My opinions for these problems are:

FWIW, branch names aren't the only thing that break with case insensitivity. /api/v1/repo/issues/search?labels=foo will happily return issues labeled "Foo" when case insensitivity is in play, while it would normally only return issues labeled "foo". It's not as spectacular a bug as the branch names, but a bug nevertheless.

And there are a fair number of other, similar issues present.

It's hard to say it is right or wrong. For example, some people depends on the case-insensitive behavior to do database LIKE %...% search. For those people, using utf8mb4_bin is breaking.

I would try to propose a complete PR later, to leave a chance to end users, while make the behavior consistent by default for most users.

You may also want to fix gitea doctor convert while there, because - as far as I see, but correct me if I'm wrong - if the migration that changes the branch table runs first, and then one runs gitea doctor convert (by mistake, or as part of an upgrade process, thinking it is idempotent like migrations), that will reset the collation back to a case insensitive one. At that point, running migrations again won't help, because the database version will be the latest, so the database will be left in a broken state, and will require manual fixing.

You are right, running gitea doctor convert later would make the collation become utf8mb4_general_ci again. For such case:

  1. If the user already has case-sensitive branch names in the table, the gitea doctor convert command would fail, so nothing changes.
  2. If the user doesn't have case-sensitive branch names, maybe they is lucky enough, nearly impossible to be affected by this problem soon, and then the complete fix would have been ready.

This PR's purpose is just to make a partial/quick fix for the branch name case conflicting, including MSSQL, so the migration has the same behavior at the moment.

There are more TODOs IMO.

To address the concern "doctor convert by mistake", I think this PR could do more, eg: skip the tables which are already "utf8mb4"

@lunny lunny added the reviewed/wait-merge This pull request is part of the merge queue. It will be merged soon. label Dec 29, 2023
@algernon
Copy link
Contributor

@algernon Thank you for the information, it's thoughtful. My opinions for these problems are:

FWIW, branch names aren't the only thing that break with case insensitivity. /api/v1/repo/issues/search?labels=foo will happily return issues labeled "Foo" when case insensitivity is in play, while it would normally only return issues labeled "foo". It's not as spectacular a bug as the branch names, but a bug nevertheless.
And there are a fair number of other, similar issues present.

It's hard to say it is right or wrong. For example, some people depends on the case-insensitive behavior to do database LIKE %...% search. For those people, using utf8mb4_bin is breaking.

But it works differently when using another DB engine, so it's a bug. Either because it is case sensitive in other engines, or because it is case insensitive with MySQL. Inconsistency between engines is a bug, I hope we can agree on that. It needs to be the same on all engines. So either it needs to use something like db.BuildCaseInsensitiveLike, or MySQL needs to use a case sensitive collation for that column.

If you want case insensitive behaviour, that should be explicit, like it's done for usernames, and in a couple of other places. Gitea has helper functions like db.BuildCaseInsensitiveLike and db.BuildCaseInsensitiveIn to handle case insensitivity, which, to me, strongly suggests that the codebase assumes that the entire database is otherwise case sensitive. Having found numerous other instances of this very problem pretty much confirms that's the case.

So the real fix here is to make sure that the database is case sensitive, once and for all. For cases where case insensitivity is desired, that should be made explicit, just like it is done for usernames, or issue search (where searching by title or content are explicitly case insensitive, unlike searching by label name, which is case sensitive [but that depends on the database engine, which I assert is a bug]).

I would try to propose a complete PR later, to leave a chance to end users, while make the behavior consistent by default for most users.

Again, the problem is that the behaviour is different between DB engines. If you "leave a chance to end users", that inconsistency will remain. To fix this problem, you need to get rid of the inconsistency in the first place, either by making every engine case sensitive, or making them all case insensitive. Anything else is just magical handwaving that essentially does nothing to address the root cause. You'll just be playing cat and mouse, trying to catch all the instances of the same root cause.

You may also want to fix gitea doctor convert while there, because - as far as I see, but correct me if I'm wrong - if the migration that changes the branch table runs first, and then one runs gitea doctor convert (by mistake, or as part of an upgrade process, thinking it is idempotent like migrations), that will reset the collation back to a case insensitive one. At that point, running migrations again won't help, because the database version will be the latest, so the database will be left in a broken state, and will require manual fixing.

You are right, running gitea doctor convert later would make the collation become utf8mb4_general_ci again. For such case:

1. If the user already has case-sensitive branch names in the table, the `gitea doctor convert` command would fail, so nothing changes.

Consider this scenario: the instance has no case sensitive branch names, and the database is using a case insensitive collation. You run the migration, so it gets converted to case sensitive. Later on, the user runs gitea doctor convert, which sets the collation back to case insensitive (remember: t here are no case sensitive branches!). Then later the user tries to create a case sensitive branch, and Gitea blows up. Running the migration again will do nothing, because the database is already at the latest version.

Thus, gitea doctor convert broke the database.

2. If the user doesn't have case-sensitive branch names, maybe they is lucky enough, nearly impossible to be affected by this problem soon, and then the complete fix would have been ready.

Or, you could just do a complete fix that addresses the root cause (case insensitive collation), rather than playing cat and mouse trying to fix instances of the same problem. Like I said, it's not only about branch names. It affects issue labels, and a whole lot more. Fixing it on the column or table level also leaves you open to the same issue in the future, as soon as someone forgets to guard case sensitivity on a new table or column.

Adjusting gitea doctor convert to also set the proper collation fixes the root cause, without having to guard every single table and column (past, present and future) that may be affected. It also makes Gitea behave the same regardless of what DB engine is used.

Yes, it does not fix MSSQL (but patching each instance of the problem doesn't, either, only temporarily, at great cost). I suspect there's a reasonable fix there too, but I can't test that, so someone else will have to figure that part out.

To address the concern "doctor convert by mistake", I think this PR could do more, eg: skip the tables which are already "utf8mb4"

Yeah, that'd make it so it doesn't break the database. But if the collation would be changed in doctor convert, then it not only wouldn't break the database, it would fix the case insensitivity problem too (at least for MySQL), at the root cause.

@wxiaoguang
Copy link
Contributor Author

Adjusting gitea doctor convert to also set the proper collation fixes the root cause, without having to guard every single table and column (past, present and future) that may be affected. It also makes Gitea behave the same regardless of what DB engine is used.

That's also one of the root problem. TBH, utf8mb4_bin is not a general useful collation for texts. IIRC it affects case-casting, ordering/sorting. But MySQL and MariaDB doesn't provide consistent collations (utf8mb4_0900_as_cs vs uca1400_as_cs), that's why I haven't made up my mind to make a final decision.

@algernon
Copy link
Contributor

Maybe we should also change the COLLATE from utf8mb4_general_ci to utf8mb4_bin on _, err := x.Exec(fmt.Sprintf("ALTER DATABASE %s CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci"?

Yes. That (and changing the ALTER TABLE in a similar way a few lines later) would actually fix the root cause, and the other changes - including the migration - would be unnecessary.

@delvh delvh removed the reviewed/wait-merge This pull request is part of the merge queue. It will be merged soon. label Dec 29, 2023
@wxiaoguang
Copy link
Contributor Author

Maybe we should also change the COLLATE from utf8mb4_general_ci to utf8mb4_bin on _, err := x.Exec(fmt.Sprintf("ALTER DATABASE %s CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci"?

Yes. That (and changing the ALTER TABLE in a similar way a few lines later) would actually fix the root cause, and the other changes - including the migration - would be unnecessary.

Due to utf8mb4_bin is not a generally useful collation (#28633 (comment)), that's what I meant "to leave a chance to end users"

@algernon
Copy link
Contributor

Adjusting gitea doctor convert to also set the proper collation fixes the root cause, without having to guard every single table and column (past, present and future) that may be affected. It also makes Gitea behave the same regardless of what DB engine is used.

That's also one of the root problem. TBH, utf8mb4_bin is not a general useful collation for texts. IIRC it affects case-casting, ordering/sorting. But MySQL and MariaDB doesn't provide consistent collations (utf8mb4_0900_as_cs vs uca1400_as_cs), that's why I haven't made up my mind to make a final decision.

It isn't perfect, indeed, but it is better than utf8mb4_general_ci. As for MySQL/MariaDB: it should be possible to try both. Try changing the database collation to utf8mb4_0900_as_cs, if that fails, try changing it to uca1400_as_cs, if that fails, fall back to utf8mb4_bin. Whichever worked, use that for the ALTER TABLE calls later.

Changing the database collation is cheap, because it doesn't convert any tables, it just sets a default, pretty much, for future tables. So we can use this to figure out which collation is supported. That, or there's SHOW COLLATION WHERE collation = 'uca1400_as_cs', we can find the supported collation that way, too.

@wxiaoguang
Copy link
Contributor Author

wxiaoguang commented Dec 29, 2023

OK, I didn't post all my plans because I haven't got a full picture. Since we already have a deep discussion (thank you very much), I would like to write all what I thought:

My basic rules:

  1. Fix problems quickly, and make it improvable for future.
  2. Avoid breaking end users as much as possible.

So this PR does point 1: only fix the problem temporarily, and then when we have a complete fix, this migration (and TableCollation) could be removed.

Then about the complete fix in my mind:

It should never report fatal errors to end users. It should warn users that "your database collation seems wrong, please do ....."

The detailed steps are:

  1. Add an config option, eg: [database].CHARSET_COLLATION
    • It defaults to empty, means use default collation.
    • Default for MySQL means "utf8mb4_0900_as_cs"
    • Default for MariaDB means "uca1400_as_cs"
    • Default for MSSQL means "Latin1_General_CS_AS"
  2. Add a startup check, check the collations database and tables
    • If there are mismatched collations, show a startup warning, and show an admin warning
  3. Refactor gitea doctor convert, to use the [database].CHARSET_COLLATION to convert the tables.
  4. Remove this migration, because after these steps above, this migration is not needed anymore.
  5. Add some code to help to initialize docker database.

I think (just a brief idea) by this approach, everything should be fine eventually.

@algernon
Copy link
Contributor

I like the proper fix plan for the most part. I don't think we can easily distinguish MySQL from MariaDB from the settings alone, so setting a different default for them is going to be problematic. This is one of the reasons - and user convenience - I suggested auto-detecting it. Having a [database].DEFAULT_CHARSET and [database].DEFAULT_COLLATION that override the auto-detection does make sense, however.

I already have most of this written, by the way (I started working on the fix yesterday), including plenty of test cases, and I can adjust it to have configurable default charset / collation. If you're interested, I can submit my work in a couple of hours, once I finish it up.

@wxiaoguang
Copy link
Contributor Author

wxiaoguang commented Dec 29, 2023

I already have most of this written, by the way (I started working on the fix yesterday), including plenty of test cases, and I can adjust it to have configurable default charset / collation. If you're interested, I can submit my work in a couple of hours, once I finish it up.

I have read your PR, I am not sure whether it could implement that plan. And it lacks MSSQL support.

Another thing is that I think the tests don't need to be that complex. As long as we can guarantee the database tables are using expected collation, I think "model" related tests could be removed, or just simplify them and put them into their own(related) test file.

TBH, the code in my mind might be quite different from your PR 🤣 In history, some F* guys submitted buggy PRs and I made some changes, then they wrote long posts to criticize me and Gitea. I am really afraid of co-working with other PRs now.

@algernon
Copy link
Contributor

I have read your PR, I am not sure whether it could implement that plan. And it lacks MSSQL support.

I have not pushed my latest changes yet, because it's a work in progress, and there's still a few things to adjust. Locally, I have:

  • setting.Database extended with DefaultCharset and DefaultCollation
  • gitea doctor convert working as you described (+ auto detection): if looks at the settings, and will use appropriate defaults to set the charset & collation for the db and the tables.
  • a startup check that warns (but keeps starting up) if the charset or collation is not the one configured (or auto-detected). It currently only looks at the database charset & collation, but I'll make it iterate over the tables too, soon.

For docker, running gitea doctor convert will Do The Right Thing. Alternatively, we can add startup code that if Gitea sees an empty database, it will set the appropriate collation and charset, and then create the tables. That will solve the problem for new dockerized mariadbs. Existing ones - as any other existing db - will need a gitea doctor convert.

Another thing is that I think the tests don't need to be that complex. As long as we can guarantee the database tables are using expected collation, I think "model" related tests could be removed, or just simplify them and put them into their own(related) test file.

I like tests that exercise slightly different problems: for example, the branch name test (unpushed atm) tests that with case insensitivity, you can't create branches that differ only in case, while the issue search test tests that you can have unexpected results returned. Same underlying problem, different symptom, so I think testing both is worth the (minimal, really) effort.

I am really afraid of co-working with other PRs now.

Your call, really. I'll submit my PR anyway, and you can judge whether its worth using it. It just feels like a waste of time to do it again, when I already did most of it.

@wxiaoguang
Copy link
Contributor Author

Thank you. I will close this one and spend more time for a complex fix.

Your call, really.

Do you mean I can co-work and edit your PR?

@wxiaoguang wxiaoguang closed this Dec 29, 2023
@algernon
Copy link
Contributor

Do you mean I can co-work and edit your PR?

Yes. Just give me an hour or two to push my WIP, and you can have a go at it, and edit it at your leisure.

@wxiaoguang
Copy link
Contributor Author

wxiaoguang commented Dec 29, 2023

No hurry, I think for existing users they could manually bypass the bug (run the ALTER sql, use utf8mb4_bin, if they really need it ..... added the workaround to #28131)

There is enough time before 1.22 release.

@GiteaBot GiteaBot removed this from the 1.22.0 milestone Dec 30, 2023
@lunny lunny mentioned this pull request Dec 30, 2023
@wxiaoguang wxiaoguang deleted the fix-branch-case branch February 8, 2024 02:01
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. modifies/docs modifies/migrations size/L Denotes a PR that changes 100-499 lines, ignoring generated files. type/bug
Projects
None yet
7 participants