[#1127] YSQL: Collation Support (part 5)

Summary: This diff addresses a number of known issues with recently added YSQL collation support and also enable YSQL collation support by default. 1. Disallow alter column collation unless the old and new collation match exactly. For example: ``` create table foo(id text); CREATE TABLE insert into foo values ('aaaa'); INSERT 0 1 alter table foo alter column id set data type text collate "en_US.utf8"; ysqlsh:altc1.sql:5: ERROR: This ALTER TABLE command is not yet supported. ``` In the example, column 'id' has default collation, but the alter statement tries to change it to "en_US.utf8" which is different. An alter text column command can succeed only when collations do not change. For example: one can change column type from varchar(10) to varchar(20): ``` create table foo(id varchar(10) collate "en_US.utf8"); CREATE TABLE create index foo_id_idx on foo(id collate "C" desc); CREATE INDEX insert into foo values ('aaaa'); INSERT 0 1 alter table foo alter column id set data type varchar(20) collate "en_US.utf8"; ALTER TABLE ``` In YSQL, collation change can imply on-disk change due to collation-encoding: in docdb a character data is stored with a collation sort key. Different collation can have different sort key for the same character data. Currently YSQL only supports very limited alter column command when no on-disk change is possible. Given that, we also simply disallow column collation change. 2. Disallow creating database with any non-C collation. In my previous change, I accidently enabled create database command to create a database with a non-C collation. This isn't intended as we still assume default collation is C. In addition, any non-C collation can imply perf/storage cost. So at this time we should continue to disallow this. We can enhance it in the future if needed. For example, the following command continues to fail: ``` yugabyte=# create database db LC_COLLATE = "en_US.utf8" TEMPLATE template0; create database db LC_COLLATE = "en_US.utf8" TEMPLATE template0; ERROR: Value other than 'C' for lc_collate option is not yet supported LINE 1: create database db LC_COLLATE = "en_US.utf8" TEMPLATE templa... ^ HINT: Please report the issue on https://github.com/YugaByte/yugabyte-db/issues ``` 3. Disallow text_pattern_ops/bpchar_pattern_ops/varchar_pattern_ops in index creation unless the indexed column has "C" collation. For example, ``` create table foo(id char(10) collate "en_US.utf8"); CREATE TABLE create index foo_id_idx on foo(id bpchar_pattern_ops asc); ysqlsh:pat3.sql:14: ERROR: could not use operator class "bpchar_pattern_ops" with column collation "en_US.utf8" HINT: Use the COLLATE clause to set "C" collation explicitly. ``` The semantics of bpchar_pattern_ops is to create an index such that the index keys are no longer sorted according to the base table column collation "en_US.utf8". Instead it sorts the index keys as if the collation is "C". However postgres does not change the index column collation to "C". Instead, it relies upon bpchar_pattern_ops to select a custom comparator bttext_pattern_cmp to do the comparison. That's why currently in YSQL collation support YB will detect that the index column has "en_US.utf8" collation and the index keys will still be sorted according to "en_US.utf8" not "C" collation. Therefore it is disallowed and a hint is given to user to use a work around such as: ``` create index foo_id_idx on foo(id collate "C" asc); ``` Historically, postgres only supported database collation that is decided at initdb time via OS env variable LC_COLLATE. As a result, normal index will be sorted according to collation determined by LC_COLLATE. Such an index is not usable by certain operators such as LIKE. As a work around, postgres provided these *_pattern_ops as work around to build an index that ignores LC_COLLATE (i.e., use C collation for the index). Now that postgres supports column collation which can override the database collation, *_pattern_ops are no longer needed. 4. Changed QLValue string_value field from 'string' to 'bytes' to suppress a error message. A collation sort key is a null-terminated byte sequence (without any embedded \0 byte). As a result the collation encoded string may contain invalid UTF-8 characters. However it is set as a QLValue string_value field in place of the original string value which is UTF-8. Protobuf reports invalid UTF-8 as an ERROR even though the collation-encoded string is still sent across the wire without any loss. 5. Added upgrade support Test Plan: ./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressTypesString' ./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressExtension' ./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressPartitions' ./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressDml' ./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressPlpgsql' ./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressFeature' ./yb_build.sh release --java-test 'org.yb.pgsql.TestYsqlUpgrade' Reviewers: mihnea, alex, dmitry Reviewed By: alex, dmitry Subscribers: ksreenivasan, yql Differential Revision: https://phabricator.dev.yugabyte.com/D13363
yugabyte · Nov 3, 2021 · 1e0330b · 1e0330b
1 parent a9f463f
commit 1e0330b
Show file tree

Hide file tree

Showing 22 changed files with 2,821 additions and 103 deletions.
diff --git a/java/yb-pgsql/src/test/java/org/yb/pgsql/TestYsqlUpgrade.java b/java/yb-pgsql/src/test/java/org/yb/pgsql/TestYsqlUpgrade.java
@@ -1202,9 +1202,9 @@ private void assertMigrationsWorked(
         assertEquals(ver, migrationRow.getInt(0).intValue());
         assertEquals(0, migrationRow.getInt(1).intValue());
         assertTrue(migrationRow.getString(2).startsWith("V" + ver + "__"));
-        assertTrue("Expected migration timestamp to be at most 5 mins old!",
+        assertTrue("Expected migration timestamp to be at most 10 mins old!",
             migrationRow.getLong(3) != null &&
-                System.currentTimeMillis() - migrationRow.getLong(3) < 5 * 60 * 1000);
+                System.currentTimeMillis() - migrationRow.getLong(3) < 10 * 60 * 1000);
       }
     }
 
@@ -1421,8 +1421,8 @@ private List<Row> copyWithResolvedOids(
   }
 
   private void runMigrations() throws Exception {
-    // Set migrations timeout to 3 min, adjusted
-    long timeoutMs = BuildTypeUtil.adjustTimeout(180 * 1000);
+    // Set migrations timeout to 6 min, adjusted
+    long timeoutMs = BuildTypeUtil.adjustTimeout(360 * 1000);
     List<String> lines = runProcess(
         TestUtils.findBinary("yb-admin"),
         "--master_addresses",

diff --git a/src/postgres/src/backend/commands/dbcommands.c b/src/postgres/src/backend/commands/dbcommands.c
@@ -415,7 +415,8 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 							 "https://github.com/yugabyte/yugabyte-db/issues"),
 					 parser_errposition(pstate, dencoding->location)));
 
-		if (!YBIsCollationEnabled() && dcollate && dbcollate && strcmp(dbcollate, "C") != 0)
+		if (!(YBIsCollationEnabled() && kTestOnlyUseOSDefaultCollation) && dcollate &&
+			dbcollate && strcmp(dbcollate, "C") != 0)
 			ereport(YBUnsupportedFeatureSignalLevel(),
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("Value other than 'C' for lc_collate "

diff --git a/src/postgres/src/backend/commands/indexcmds.c b/src/postgres/src/backend/commands/indexcmds.c
@@ -25,6 +25,7 @@
 #include "catalog/indexing.h"
 #include "catalog/partition.h"
 #include "catalog/pg_am.h"
+#include "catalog/pg_collation.h"
 #include "catalog/pg_constraint.h"
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_opclass.h"
@@ -1477,6 +1478,46 @@ CheckPredicate(Expr *predicate)
 				 errmsg("functions in index predicate must be marked IMMUTABLE")));
 }
 
+/*
+ * YbCheckCollationRestrictions
+ *		Checks that the given partial-index predicate is valid.
+ * Disallow some built-in operator classes if the column has non-C collation.
+ * We already accept them if the column has C collation so continue to allow that.
+ */
+static void
+YbCheckCollationRestrictions(Oid attcollation, Oid opclassoid)
+{
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	char *opclassname;
+	HeapTuple collationtup;
+	Form_pg_collation collform;
+	char *collname;
+
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+	opclassname = NameStr(classform->opcname);
+	if (strcasecmp(opclassname, "bpchar_pattern_ops") == 0 ||
+		strcasecmp(opclassname, "text_pattern_ops") == 0 ||
+		strcasecmp(opclassname, "varchar_pattern_ops") == 0)
+	{
+		collationtup = SearchSysCache1(COLLOID, ObjectIdGetDatum(attcollation));
+		if (!HeapTupleIsValid(collationtup))
+			elog(ERROR, "cache lookup failed for collation %u", attcollation);
+		collform = (Form_pg_collation) GETSTRUCT(collationtup);
+		collname = NameStr(collform->collname);
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("could not use operator class \"%s\" with column collation \"%s\"",
+						opclassname, collname),
+				 errhint("Use the COLLATE clause to set \"C\" collation explicitly.")));
+		ReleaseSysCache(collationtup);
+	}
+	ReleaseSysCache(classtup);
+}
+
 /*
  * Compute per-index-column information, including indexed column numbers
  * or index expressions, opclasses, and indoptions. Note, all output vectors
@@ -1757,6 +1798,15 @@ ComputeIndexAttrs(IndexInfo *indexInfo,
 										 accessMethodName,
 										 accessMethodId);
 
+		/*
+	 	 * In Yugabyte mode, disallow some built-in operator classes if the column has non-C
+	 	 * collation.
+		 */
+		if (IsYugaByteEnabled() &&
+			YBIsCollationValidNonC(attcollation) &&
+			!kTestOnlyUseOSDefaultCollation)
+			YbCheckCollationRestrictions(attcollation, classOidP[attn]);
+
 		/*
 		 * Identify the exclusion operator, if any.
 		 */

diff --git a/src/postgres/src/backend/commands/ybccmds.c b/src/postgres/src/backend/commands/ybccmds.c
@@ -967,6 +967,15 @@ YBCPrepareAlterTableCmd(AlterTableCmd* cmd, Relation rel, YBCPgStatement handle,
 							errmsg("This ALTER TABLE command is not yet supported.")));
 				}
 			}
+			/*
+			 * Do not allow collation update because that requires different collation
+			 * encoding and therefore can cause on-disk changes.
+			 */
+			Oid cur_collation_id = attTup->attcollation;
+			Oid new_collation_id = GetColumnDefCollation(NULL, colDef, newTypId);
+			if (cur_collation_id != new_collation_id)
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("This ALTER TABLE command is not yet supported.")));
 			break;
 		}
 

diff --git a/src/postgres/src/common/pg_yb_common.c b/src/postgres/src/common/pg_yb_common.c
@@ -162,7 +162,7 @@ YBIsCollationEnabled()
 		 * The default value must be in sync with that of FLAGS_TEST_pg_collation_enabled.
 		 */
 		cached_value = YBCIsEnvVarTrueWithDefault("FLAGS_TEST_pg_collation_enabled",
-												  false /* default_value */);
+												  true /* default_value */);
 	}
 	return cached_value;
 #else

diff --git a/src/postgres/src/include/catalog/pg_yb_migration.dat b/src/postgres/src/include/catalog/pg_yb_migration.dat
@@ -11,6 +11,6 @@
 
 [
 
-{ major => '11', minor => '0', name => '<baseline>', time_applied => '_null_' }
+{ major => '12', minor => '0', name => '<baseline>', time_applied => '_null_' }
 
 ]
diff --git a/src/postgres/src/test/regress/expected/yb_extensions.out b/src/postgres/src/test/regress/expected/yb_extensions.out
@@ -57,30 +57,30 @@ select pg_stat_statements_reset();
 (1 row)
 
 select userid,dbid,queryid,query,calls,rows,shared_blks_hit,shared_blks_read,shared_blks_dirtied,shared_blks_written,local_blks_hit,local_blks_read,local_blks_dirtied,local_blks_written,temp_blks_read,temp_blks_written,blk_read_time from pg_stat_statements;
- userid | dbid  |      queryid       |               query               | calls | rows | shared_blks_hit | shared_blks_read | shared_blks_dirtied | shared_blks_written | local_blks_hit | local_blks_read | local_blks_dirtied | local_blks_written | temp_blks_read | temp_blks_written | blk_read_time
---------+-------+--------------------+-----------------------------------+-------+------+-----------------+------------------+---------------------+---------------------+----------------+-----------------+--------------------+--------------------+----------------+-------------------+---------------
-  12462 | 12463 | 304651465921150967 | select pg_stat_statements_reset() |     1 |    1 |               0 |                0 |                   0 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |             0
+ userid | dbid  |       queryid       |               query               | calls | rows | shared_blks_hit | shared_blks_read | shared_blks_dirtied | shared_blks_written | local_blks_hit | local_blks_read | local_blks_dirtied | local_blks_written | temp_blks_read | temp_blks_written | blk_read_time
+--------+-------+---------------------+-----------------------------------+-------+------+-----------------+------------------+---------------------+---------------------+----------------+-----------------+--------------------+--------------------+----------------+-------------------+---------------
+  13247 | 13248 | 8978530904538012215 | select pg_stat_statements_reset() |     1 |    1 |               0 |                0 |                   0 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |             0
 (1 row)
 
 create table test(a int, b float);
 insert into test(a,b) values (5,10);
 select userid,dbid,queryid,query,calls,rows,shared_blks_hit,shared_blks_read,shared_blks_dirtied,shared_blks_written,local_blks_hit,local_blks_read,local_blks_dirtied,local_blks_written,temp_blks_read,temp_blks_written,blk_read_time from pg_stat_statements;
  userid | dbid  |       queryid        |                                                                                                                              query                                                                                                                               | calls | rows | shared_blks_hit | shared_blks_read | shared_blks_dirtied | shared_blks_written | local_blks_hit | local_blks_read | local_blks_dirtied | local_blks_written | temp_blks_read | temp_blks_written | blk_read_time
 --------+-------+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+------+-----------------+------------------+---------------------+---------------------+----------------+-----------------+--------------------+--------------------+----------------+-------------------+---------------
-  12462 | 12463 |   304651465921150967 | select pg_stat_statements_reset()                                                                                                                                                                                                                                |     1 |    1 |               0 |                0 |                   0 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |             0
-  12462 | 12463 | -1399701522274914821 | insert into test(a,b) values ($1,$2)                                                                                                                                                                                                                             |     1 |    1 |               0 |                0 |                   0 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |             0
-  12462 | 12463 | -6731041317835482454 | select userid,dbid,queryid,query,calls,rows,shared_blks_hit,shared_blks_read,shared_blks_dirtied,shared_blks_written,local_blks_hit,local_blks_read,local_blks_dirtied,local_blks_written,temp_blks_read,temp_blks_written,blk_read_time from pg_stat_statements |     1 |    1 |               0 |                0 |                   0 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |             0
-  12462 | 12463 |  8223779431833792722 | create table test(a int, b float)                                                                                                                                                                                                                                |     1 |    0 |               0 |                0 |                   0 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |             0
+  13247 | 13248 | -2113262558775307896 | select userid,dbid,queryid,query,calls,rows,shared_blks_hit,shared_blks_read,shared_blks_dirtied,shared_blks_written,local_blks_hit,local_blks_read,local_blks_dirtied,local_blks_written,temp_blks_read,temp_blks_written,blk_read_time from pg_stat_statements |     1 |    1 |               0 |                0 |                   0 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |             0
+  13247 | 13248 | -1399701522274914821 | insert into test(a,b) values ($1,$2)                                                                                                                                                                                                                             |     1 |    1 |               0 |                0 |                   0 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |             0
+  13247 | 13248 |  8978530904538012215 | select pg_stat_statements_reset()                                                                                                                                                                                                                                |     1 |    1 |               0 |                0 |                   0 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |             0
+  13247 | 13248 |  8223779431833792722 | create table test(a int, b float)                                                                                                                                                                                                                                |     1 |    0 |               0 |                0 |                   0 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |             0
 (4 rows)
 
 insert into test(a,b) values (15,20);
 select userid,dbid,queryid,query,calls,rows,shared_blks_hit,shared_blks_read,shared_blks_dirtied,shared_blks_written,local_blks_hit,local_blks_read,local_blks_dirtied,local_blks_written,temp_blks_read,temp_blks_written,blk_read_time from pg_stat_statements;
  userid | dbid  |       queryid        |                                                                                                                              query                                                                                                                               | calls | rows | shared_blks_hit | shared_blks_read | shared_blks_dirtied | shared_blks_written | local_blks_hit | local_blks_read | local_blks_dirtied | local_blks_written | temp_blks_read | temp_blks_written | blk_read_time
 --------+-------+----------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+------+-----------------+------------------+---------------------+---------------------+----------------+-----------------+--------------------+--------------------+----------------+-------------------+---------------
-  12462 | 12463 |   304651465921150967 | select pg_stat_statements_reset()                                                                                                                                                                                                                                |     1 |    1 |               0 |                0 |                   0 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |             0
-  12462 | 12463 | -1399701522274914821 | insert into test(a,b) values ($1,$2)                                                                                                                                                                                                                             |     2 |    2 |               0 |                0 |                   0 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |             0
-  12462 | 12463 | -6731041317835482454 | select userid,dbid,queryid,query,calls,rows,shared_blks_hit,shared_blks_read,shared_blks_dirtied,shared_blks_written,local_blks_hit,local_blks_read,local_blks_dirtied,local_blks_written,temp_blks_read,temp_blks_written,blk_read_time from pg_stat_statements |     2 |    5 |               0 |                0 |                   0 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |             0
-  12462 | 12463 |  8223779431833792722 | create table test(a int, b float)                                                                                                                                                                                                                                |     1 |    0 |               0 |                0 |                   0 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |             0
+  13247 | 13248 | -2113262558775307896 | select userid,dbid,queryid,query,calls,rows,shared_blks_hit,shared_blks_read,shared_blks_dirtied,shared_blks_written,local_blks_hit,local_blks_read,local_blks_dirtied,local_blks_written,temp_blks_read,temp_blks_written,blk_read_time from pg_stat_statements |     2 |    5 |               0 |                0 |                   0 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |             0
+  13247 | 13248 | -1399701522274914821 | insert into test(a,b) values ($1,$2)                                                                                                                                                                                                                             |     2 |    2 |               0 |                0 |                   0 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |             0
+  13247 | 13248 |  8978530904538012215 | select pg_stat_statements_reset()                                                                                                                                                                                                                                |     1 |    1 |               0 |                0 |                   0 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |             0
+  13247 | 13248 |  8223779431833792722 | create table test(a int, b float)                                                                                                                                                                                                                                |     1 |    0 |               0 |                0 |                   0 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |             0
 (4 rows)
 
 drop table test;
@@ -97,4 +97,4 @@ select uuid_generate_v1() != uuid_generate_v4();
  ?column?
 ----------
  t
-(1 row)
+(1 row)
diff --git a/src/postgres/src/test/regress/expected/yb_feature_temp.out b/src/postgres/src/test/regress/expected/yb_feature_temp.out
@@ -117,15 +117,15 @@ SELECT * FROM temptest;
 COMMIT;
 ERROR:  Illegal state: Transaction for catalog table write operation 'pg_type' not found
 SELECT * FROM temptest;
-ERROR:  could not open file "base/12463/t2_16950": No such file or directory
+ERROR:  could not open file "base/13248/t2_16950": No such file or directory
 BEGIN;
 CREATE TEMP TABLE temptest(col) ON COMMIT DROP AS SELECT 1;
 ERROR:  relation "temptest" already exists
 SELECT * FROM temptest;
 ERROR:  current transaction is aborted, commands ignored until end of transaction block
 COMMIT;
 SELECT * FROM temptest;
-ERROR:  could not open file "base/12463/t2_16950": No such file or directory
+ERROR:  could not open file "base/13248/t2_16950": No such file or directory
 -- ON COMMIT is only allowed for TEMP
 CREATE TABLE temptest(col int) ON COMMIT DELETE ROWS;
 ERROR:  ON COMMIT can only be used on temporary tables
-Original file line number
+Diff line change
@@ Expand Up / @@ -11,6 +11,6 @@ @@
     [
-    { major => '11', minor => '0', name => '<baseline>', time_applied => '_null_' }
+    { major => '12', minor => '0', name => '<baseline>', time_applied => '_null_' }
     ]