Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

types: convert to new charset before inserting to blob or json column #31031

Merged
merged 33 commits into from
Dec 30, 2021

Conversation

tangenta
Copy link
Contributor

What problem does this PR solve?

Issue Number: close #30690

Problem Summary:

set names gbk;
drop table if exists t;
create table t (b blob, d json);
insert into t values ('你好', '{"测试": "你好"}');
select * from t;

The expected insertion procedure for TiDB is:

  1. '你好' is parsed with GBK(because of the@@character_set_client), resulting a UTF-8 string '浣犲ソ' datum annotated with "gbk_bin"(because of the @@character_set_connection).
  2. Before inserting to a binary type like blob and json, another conversion from UTF-8 to GBK is needed. Thus '浣犲ソ' should be changed back to '你好'.

Previously, step 2 is missed for type blob and json, this PR fix the issue.

What is changed and how it works?

This PR also moves string encoding/decoding/validation to Datum.ConvertTo().

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Dec 26, 2021

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • Defined2014
  • xiongjiwei

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 26, 2021
@ti-chi-bot ti-chi-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 27, 2021
@dianqihanwangzi
Copy link
Contributor

/run-check_dev_2

types/datum.go Outdated Show resolved Hide resolved
types/datum.go Outdated Show resolved Hide resolved
@@ -5,7 +5,7 @@ insert into t values ('中文', 'asdf', '字符集');
insert into t values ('À', 'ø', '😂');
Error 1366: Incorrect string value '\xC3\x80' for column 'a'
insert into t values ('中文À中文', 'asdføfdsa', '字符集😂字符集');
Error 1366: Incorrect string value '\xC3\x80\xE4\xB8\xAD\xE6...' for column 'a'
Error 1366: Incorrect string value '\xC3\x80' for column 'a'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not same as MySQL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is OK to have this difference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Em... Not sure. Seems we had a PR to fix it before. #25087

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR does not introduce the garbled code as described in #25087. It extracts the error message from ErrInvalidCharacterSet(which should be in a well format).

The only difference is the number of invalid bytes displayed.

@tangenta
Copy link
Contributor Author

testColumnTypeChangeSuite.TestChangeFromBitToStringInvalidUtf8ErrMsg is failed but I try it on MySQL 5.7.36 and 8.0.27. The result is different from the test:

mysql> create table t (a bit(45));
Query OK, 0 rows affected (0.02 sec)

mysql> insert into t values (117471723421);
Query OK, 1 row affected (0.01 sec)

mysql> alter table t modify column a varchar(31) collate utf8mb4_general_ci;
Query OK, 1 row affected (0.05 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> select * from t;
+--------------+
| a            |
+--------------+
| 117471723421 |
+--------------+
1 row in set (0.03 sec)

We can see that even if non-binary string, the bit type is changed to int first instead of varchar directly. This is conflict with #25037 (comment):

2: when cast the bit to binary, we should consider to convert it to uint then cast uint to string, rather than takeing the bit to string directly.

types/datum.go Outdated Show resolved Hide resolved
types/datum.go Outdated Show resolved Hide resolved
@ti-chi-bot ti-chi-bot removed the status/can-merge Indicates a PR has been approved by a committer. label Dec 30, 2021
@tangenta
Copy link
Contributor Author

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 0c00650

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Dec 30, 2021
@tangenta
Copy link
Contributor Author

/run-unit-test

[2021-12-30T07:50:53.565Z] FAIL: client_test.go:104: testRestoreClientSuite.TestIsOnline
[2021-12-30T07:50:53.565Z] 
[2021-12-30T07:50:53.565Z] client_test.go:105:
[2021-12-30T07:50:53.565Z]     c.Assert(s.mock.Start(), IsNil)
[2021-12-30T07:50:53.565Z] ... value *errors.withStack = listen tcp 0.0.0.0:41402: bind: address already in use ("listen tcp 0.0.0.0:41402: bind: address already in use")

@zimulala
Copy link
Contributor

/run-unit-test


[2021-12-30T08:38:41.665Z] FAIL: db_test.go:40: testRestoreSchemaSuite.SetUpSuite
[2021-12-30T08:38:41.665Z] 
[2021-12-30T08:38:41.665Z] db_test.go:47:
[2021-12-30T08:38:41.665Z]     c.Assert(s.mock.Start(), IsNil)
[2021-12-30T08:38:41.665Z] ... value *errors.withStack = listen tcp 0.0.0.0:36262: bind: address already in use ("listen tcp 0.0.0.0:36262: bind: address already in use")
[2021-12-30T08:38:41.665Z] 

@ti-chi-bot ti-chi-bot removed the status/can-merge Indicates a PR has been approved by a committer. label Dec 30, 2021
@tangenta
Copy link
Contributor Author

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: d96bcde

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Dec 30, 2021
@ti-chi-bot
Copy link
Member

@tangenta: Your PR was out of date, I have automatically updated it for you.

At the same time I will also trigger all tests for you:

/run-all-tests

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@tangenta
Copy link
Contributor Author

/run-unit-test because it is timeout.
/run-check_dev_2 because a goroutine is leak.

@tangenta
Copy link
Contributor Author

/run-unit-test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The inserted result is different with blob type in gbk dml test
7 participants