fix: CI failed plpython and plpython3 #1

yihong0618 · 2022-03-23T09:10:27Z

Here are some reminders before you submit the pull request

Add tests for the change
Document changes
Communicate in the mailing list if needed
Pass make installcheck
Review a PR in return to support the community

This is based the PR https://github.com/greenplum-db/gpdb/pull/13095

fix the plpython Ci failed
based on the difference python2 and python3 created two files plpython_returns_2.out and plpython_returns_3.out for the different behavior of these
change the Makefile for support the plpython3 install_check
change the init_file for the uselesss diff

beeender

Please take a note of the reasons for python3 test case changes. We need to summarize it in our final commit message.

beeender · 2022-03-25T12:15:42Z

src/pl/plpython/Makefile

-# 	plpython_returns \
-# 	plpython_gpdb \
-# 	plpython_drop
+# REGRESS = plpython_record plpython_returns plpython_schema 


this line can be removed

beeender · 2022-03-25T12:21:25Z

src/pl/plpython/expected/plpython_returns_2.out

@@ -0,0 +1,3154 @@
+-- Tests Datatypes for Input Parameters and return values


why this is needed? I think the plpython_returns.out is for python2?

I think we don't need this file

beeender · 2022-03-25T12:23:07Z

src/pl/plpython/expected/plpython_returns_3.out

+(1 row)
+
+SELECT * FROM test_return_table_record('{"first": "value", "second": 4}, "third": "ignored"}');
+ERROR:  SyntaxError: invalid syntax (<string>, line 1) (plpy_elog.c:106)


This doesn't work on my computer. It should not be a big deal as long as it passes on our CI.

❯ python3 --version Python 3.10.2

--- \/home\/cc\/repo\/gpdb6\/src\/pl\/plpython3\/expected\/python3\/plpython_returns\.out 2022-03-25 20:19:38.405854360 +0800 +++ \/home\/cc\/repo\/gpdb6\/src\/pl\/plpython3\/results\/python3\/plpython_returns\.out 2022-03-25 20:19:38.424854218 +0800 @@ -1977,7 +1977,7 @@ (1 row) SELECT * FROM test_return_table_record('{"first": "value", "second": 4}, "third": "ignored"}'); -ERROR: SyntaxError: invalid syntax (<string>, line 1) +ERROR: SyntaxError: unmatched '}' (<string>, line 1) -- ===================================================== -- TEST 4: RETURN VALUE TESTING - SETOF Compound values -- ===================================================== @@ -2093,7 +2093,7 @@ SELECT test_return_setof_type_record('[{"first": "value", "second": 4, "third": "ignored"}]'); SELECT * FROM test_return_setof_type_record('[{"first": "value", "second": 4}, "third": "ignored"}]'); -ERROR: SyntaxError: invalid syntax (<string>, line 1) +ERROR: SyntaxError: closing parenthesis '}' does not match opening parenthesis '[' (<string>, line 1) -- -- Case 2: From a setof Table Type -- @@ -2200,7 +2200,7 @@ SELECT test_return_setof_table_record('[{"first": "value", "second": 4, "third": "ignored"}]'); SELECT * FROM test_return_setof_table_record('[{"first": "value", "second": 4}, "third": "ignored"}]'); -ERROR: SyntaxError: invalid syntax (<string>, line 1) +ERROR: SyntaxError: closing parenthesis '}' does not match opening parenthesis '[' (<string>, line 1) -- ===================================================== -- TEST 5: RETURN VALUE TESTING - OUT Parameters -- ===================================================== @@ -2896,7 +2896,7 @@ (1 row) SELECT * FROM test_return_out_setof_record('[{"first": "value", "second": 4}, "third": "ignored"}]'); -ERROR: SyntaxError: invalid syntax (<string>, line 1) +ERROR: SyntaxError: closing parenthesis '}' does not match opening parenthesis '[' (<string>, line 1) -- From Python List of String (one character per column) SELECT test_return_out_setof_record('["a4"]'); ERROR: malformed record literal: "a4" ======================================================================

beeender · 2022-03-25T12:24:47Z

src/pl/plpython3/Makefile

@@ -153,6 +153,8 @@ REGRESS := $(foreach test,$(REGRESS),$(if $(filter $(test),$(REGRESS_PLPYTHON3_M
 .PHONY: pgregress-python3-mangle
 pgregress-python3-mangle:
 	$(MKDIR_P) sql/python3 expected/python3 results/python3
+	# for python and python3 types and returns test are not the same.
+	cp -f ../plpython/expected/plpython_returns_3.out ../plpython/expected/plpython_returns.out


I think this line is not needed since we have line 174?
Plus this will create a git diff which is not ideal:

❯ git status On branch hy/plpython3 Your branch is up to date with 'beeender/hy/plpython3'. Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git restore <file>..." to discard changes in working directory) modified: ../../backend/utils/error/elog.c modified: ../plpython/expected/plpython_returns.out

beeender · 2022-03-25T12:26:06Z

src/pl/plpython/Makefile

@@ -189,6 +191,7 @@ check: all submake
 	$(pg_regress_check) $(REGRESS_OPTS) $(REGRESS)

 installcheck: submake
+	cp expected/plpython_returns_2.out expected/plpython_returns.out


I don't think this is needed since python3's expected is always in expected/python3/plpython_types.out

…CREATE/ALTER resouce group. In some scenarios, the AccessExclusiveLock for table pg_resgroupcapability may cause database setup/recovery pending. Below is why we need change the AccessExclusiveLock to ExclusiveLock. This lock on table pg_resgroupcapability is used to concurrent update this table when run "Create/Alter resource group" statement. There is a CPU limit, after modify one resource group, it has to check if the whole CPU usage of all resource groups doesn't exceed 100%. Before this fix, AccessExclusiveLock is used. Suppose one user is running "Alter resource group" statement, QD will dispatch this statement to all QEs, so it is a two phase commit(2PC) transaction. When QD dispatched "Alter resource group" statement and QE acquire the AccessExclusiveLock for table pg_resgroupcapability. Until the 2PC distributed transaction committed, QE can release the AccessExclusiveLock for this table. In the second phase, QD will call function doNotifyingCommitPrepared to broadcast "commit prepared" command to all QEs, QE has already finish prepared, this transation is a prepared transaction. Suppose at this point, there is a primary segment down and a mirror will be promoted to primary. The mirror got the "promoted" message from coordinator, and will recover based on xlog from primary, in order to recover the prepared transaction, it will read the prepared transaction log entry and acquire AccessExclusiveLock for table pg_resgroupcapability. The callstack is: #0 lock_twophase_recover (xid=, info=, recdata=, len=) at lock.c:4697 #1 ProcessRecords (callbacks=, xid=2933, bufptr=0x1d575a8 "") at twophase.c:1757 #2 RecoverPreparedTransactions () at twophase.c:2214 #3 StartupXLOG () at xlog.c:8013 #4 StartupProcessMain () at startup.c:231 #5 AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7fff84b94a70) at bootstrap.c:459 #6 StartChildProcess (type=StartupProcess) at postmaster.c:5917 #7 PostmasterMain (argc=argc@entry=7, argv=argv@entry=0x1d555b0) at postmaster.c:1581 #8 main (argc=7, argv=0x1d555b0) at main.c:240 After that, the database instance will start up, all related initialization functions will be called. However, there is a function named "InitResGroups", it will acquire AccessShareLock for table pg_resgroupcapability and do some initialization stuff. The callstack is: #6 WaitOnLock (locallock=locallock@entry=0x1c7f248, owner=owner@entry=0x1ca0a40) at lock.c:1999 #7 LockAcquireExtended (locktag=locktag@entry=0x7ffd15d18d90, lockmode=lockmode@entry=1, sessionLock=sessionLock@entry=false, dontWait=dontWait@entry=false, reportMemoryError=reportMemoryError@entry=true, locallockp=locallockp@entry=0x7ffd15d18d88) at lock.c:1192 #8 LockRelationOid (relid=6439, lockmode=1) at lmgr.c:126 #9 relation_open (relationId=relationId@entry=6439, lockmode=lockmode@entry=1) at relation.c:56 #10 table_open (relationId=relationId@entry=6439, lockmode=lockmode@entry=1) at table.c:47 #11 InitResGroups () at resgroup.c:581 #12 InitResManager () at resource_manager.c:83 #13 initPostgres (in_dbname=, dboid=dboid@entry=0, username=username@entry=0x1c5b730 "linw", useroid=useroid@entry=0, out_dbname=out_dbname@entry=0x0, override_allow_connections=override_allow_connections@entry=false) at postinit.c:1284 #14 PostgresMain (argc=1, argv=argv@entry=0x1c8af78, dbname=0x1c89e70 "postgres", username=0x1c5b730 "linw") at postgres.c:4812 #15 BackendRun (port=, port=) at postmaster.c:4922 #16 BackendStartup (port=0x1c835d0) at postmaster.c:4607 #17 ServerLoop () at postmaster.c:1963 #18 PostmasterMain (argc=argc@entry=7, argv=argv@entry=0x1c595b0) at postmaster.c:1589 #19 in main (argc=7, argv=0x1c595b0) at main.c:240 The AccessExclusiveLock is not released, and it is not compatible with any other locks, so the startup process will be pending on this lock. So the mirror can't become primary successfully. Even users run "gprecoverseg" to recover the primary segment. the result is similar. The primary segment will recover from xlog, it will recover prepared transactions and acquire AccessExclusiveLock for table pg_resgroupcapability. Then the startup process is pending on this lock. Unless users change the resource type to "queue", the function InitResGroups will not be called, and won't be blocked, then the primary segment can startup normally. After this fix, ExclusiveLock is acquired when alter resource group. In above case, the startup process acquires AccessShareLock, ExclusiveLock and AccessShareLock are compatible. The startup process can run successfully. After startup, QE will get RECOVERY_COMMIT_PREPARED command from QD, it will finish the second phase of this distributed transaction and release ExclusiveLock on table pg_resgroupcapability. The callstack is: #0 lock_twophase_postcommit (xid=, info=, recdata=0x3303458, len=) at lock.c:4758 #1 ProcessRecords (callbacks=, xid=, bufptr=0x3303458 "") at twophase.c:1757 #2 FinishPreparedTransaction (gid=gid@entry=0x323caf5 "25", isCommit=isCommit@entry=true, raiseErrorIfNotFound=raiseErrorIfNotFound@entry=false) at twophase.c:1704 #3 in performDtxProtocolCommitPrepared (gid=gid@entry=0x323caf5 "25", raiseErrorIfNotFound=raiseErrorIfNotFound@entry=false) at cdbtm.c:2107 #4 performDtxProtocolCommand (dtxProtocolCommand=dtxProtocolCommand@entry=DTX_PROTOCOL_COMMAND_RECOVERY_COMMIT_PREPARED, gid=gid@entry=0x323caf5 "25", contextInfo=contextInfo@entry=0x10e1820 ) at cdbtm.c:2279 #5 exec_mpp_dtx_protocol_command (contextInfo=0x10e1820 , gid=0x323caf5 "25", loggingStr=0x323cad8 "Recovery Commit Prepared", dtxProtocolCommand=DTX_PROTOCOL_COMMAND_RECOVERY_COMMIT_PREPARED) at postgres.c:1570 #6 PostgresMain (argc=, argv=argv@entry=0x3268f98, dbname=0x3267e90 "postgres", username=) at postgres.c:5482 The test case of this commit simulates a repro of this bug.

…ce (#12447) Recently I built from GreenPlum master branch to run TPC-DS query with 1GB data. For Q47 and Q57, when I turned off GUC `execute_pruned_plan` (on by default), some of worker processes will be hang and the query never returns. Take Q57 as an example. My cluster configuration is 1 QD + 2 QE. The query looks like: ```sql with v1 as( select i_category,i_brand, cc_name,d_year,d_moy, sum(cs_sales_price) sum_sales, avg(sum(cs_sales_price)) over (partition by i_category,i_brand,cc_name,d_year) avg_monthly_sales, rank() over (partition by i_category,i_brand,cc_name order by d_year,d_moy ) rn from item,catalog_sales,date_dim,call_center where cs_item_sk = i_item_sk and cs_sold_date_sk = d_date_sk and cc_call_center_sk= cs_call_center_sk and( d_year = 1999 or ( d_year = 1999-1 and d_moy =12) or ( d_year = 1999+1 and d_moy =1) ) group by i_category,i_brand,cc_name,d_year,d_moy ), v2 as( select v1.i_category,v1.i_brand,v1.cc_name, v1.d_year,v1.d_moy,v1.avg_monthly_sales, v1.sum_sales,v1_lag.sum_sales psum, v1_lead.sum_sales nsum from v1,v1 v1_lag,v1 v1_lead where v1.i_category = v1_lag.i_category and v1.i_category = v1_lead.i_category and v1.i_brand = v1_lag.i_brand and v1.i_brand = v1_lead.i_brand and v1. cc_name = v1_lag. cc_name and v1. cc_name = v1_lead. cc_name and v1.rn = v1_lag.rn + 1 and v1.rn = v1_lead.rn - 1 ) select * from v2 where d_year = 1999 and avg_monthly_sales > 0 and case when avg_monthly_sales > 0 then abs(sum_sales - avg_monthly_sales) / avg_monthly_sales else null end > 0.1 order by sum_sales - avg_monthly_sales,3 limit 100; ``` When `execute_pruned_plan` is on by default, the plan looks like: ``` QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Result (cost=0.00..2832.84 rows=1 width=64) (actual time=10792.606..10792.702 rows=100 loops=1) -> Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..2832.84 rows=1 width=64) (actual time=10792.597..10792.673 rows=100 loops=1) Merge Key: ((share0_ref4.sum_sales - share0_ref4.avg_monthly_sales)), share0_ref4.cc_name -> Sort (cost=0.00..2832.84 rows=1 width=72) (actual time=10791.203..10791.225 rows=50 loops=1) Sort Key: ((share0_ref4.sum_sales - share0_ref4.avg_monthly_sales)), share0_ref4.cc_name Sort Method: quicksort Memory: 152kB -> Sequence (cost=0.00..2832.84 rows=1 width=72) (actual time=10790.522..10790.559 rows=50 loops=1) -> Shared Scan (share slice:id 1:0) (cost=0.00..1539.83 rows=1 width=1) (actual time=10140.895..10145.397 rows=16510 loops=1) -> WindowAgg (cost=0.00..1539.83 rows=1 width=56) (actual time=10082.465..10128.750 rows=16510 loops=1) Partition By: item.i_category, item.i_brand, call_center.cc_name Order By: date_dim.d_year, date_dim.d_moy -> Sort (cost=0.00..1539.83 rows=1 width=48) (actual time=10082.429..10084.923 rows=16510 loops=1) Sort Key: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year, date_dim.d_moy Sort Method: quicksort Memory: 20078kB -> Redistribute Motion 2:2 (slice2; segments: 2) (cost=0.00..1539.83 rows=1 width=48) (actual time=9924.269..9989.657 rows=16510 loops=1) Hash Key: item.i_category, item.i_brand, call_center.cc_name -> WindowAgg (cost=0.00..1539.83 rows=1 width=48) (actual time=9924.717..9974.500 rows=16633 loops=1) Partition By: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year -> Sort (cost=0.00..1539.83 rows=1 width=126) (actual time=9924.662..9927.280 rows=16633 loops=1) Sort Key: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year Sort Method: quicksort Memory: 20076kB -> Redistribute Motion 2:2 (slice3; segments: 2) (cost=0.00..1539.83 rows=1 width=126) (actual time=9394.220..9856.375 rows=16633 loops=1) Hash Key: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year -> GroupAggregate (cost=0.00..1539.83 rows=1 width=126) (actual time=9391.783..9833.988 rows=16424 loops=1) Group Key: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year, date_dim.d_moy -> Sort (cost=0.00..1539.83 rows=1 width=124) (actual time=9397.448..9628.606 rows=174584 loops=1) Sort Key: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year, date_dim.d_moy Sort Method: external merge Disk: 134144kB -> Redistribute Motion 2:2 (slice4; segments: 2) (cost=0.00..1539.83 rows=1 width=124) (actual time=6107.447..8237.581 rows=174584 loops=1) Hash Key: item.i_category, item.i_brand, call_center.cc_name, date_dim.d_year, date_dim.d_moy -> Hash Join (cost=0.00..1539.83 rows=1 width=124) (actual time=6112.706..7088.349 rows=178669 loops=1) Hash Cond: (date_dim.d_date_sk = catalog_sales.cs_sold_date_sk) -> Seq Scan on date_dim (cost=0.00..436.38 rows=204 width=12) (actual time=10.656..17.972 rows=222 loops=1) Filter: ((d_year = 1999) OR ((d_year = 1998) AND (d_moy = 12)) OR ((d_year = 2000) AND (d_moy = 1))) Rows Removed by Filter: 36504 -> Hash (cost=1103.41..1103.41 rows=1 width=120) (actual time=6100.040..6100.040 rows=1430799 loops=1) Buckets: 16384 (originally 16384) Batches: 32 (originally 1) Memory Usage: 12493kB -> Broadcast Motion 2:2 (slice5; segments: 2) (cost=0.00..1103.41 rows=1 width=120) (actual time=1.802..5410.377 rows=1434428 loops=1) -> Nested Loop (cost=0.00..1103.40 rows=1 width=120) (actual time=1.632..5127.625 rows=718766 loops=1) Join Filter: true -> Redistribute Motion 2:2 (slice6; segments: 2) (cost=0.00..1097.40 rows=1 width=22) (actual time=1.564..362.958 rows=718766 loops=1) Hash Key: catalog_sales.cs_item_sk -> Hash Join (cost=0.00..1097.40 rows=1 width=22) (actual time=1.112..996.643 rows=717589 loops=1) Hash Cond: (catalog_sales.cs_call_center_sk = call_center.cc_call_center_sk) -> Seq Scan on catalog_sales (cost=0.00..509.10 rows=720774 width=18) (actual time=0.144..602.362 rows=721193 loops=1) -> Hash (cost=431.00..431.00 rows=1 width=12) (actual time=0.022..0.022 rows=6 loops=1) Buckets: 32768 Batches: 1 Memory Usage: 257kB -> Broadcast Motion 2:2 (slice7; segments: 2) (cost=0.00..431.00 rows=1 width=12) (actual time=0.009..0.012 rows=6 loops=1) -> Seq Scan on call_center (cost=0.00..431.00 rows=1 width=12) (actual time=0.032..0.035 rows=4 loops=1) -> Index Scan using item_pkey on item (cost=0.00..6.00 rows=1 width=102) (actual time=0.000..0.006 rows=1 loops=718766) Index Cond: (i_item_sk = catalog_sales.cs_item_sk) -> Redistribute Motion 1:2 (slice8) (cost=0.00..1293.01 rows=1 width=72) (actual time=646.614..646.646 rows=50 loops=1) -> Limit (cost=0.00..1293.01 rows=1 width=72) (actual time=10787.533..10787.700 rows=100 loops=1) -> Gather Motion 2:1 (slice9; segments: 2) (cost=0.00..1293.01 rows=1 width=72) (actual time=10787.527..10787.654 rows=100 loops=1) Merge Key: ((share0_ref4.sum_sales - share0_ref4.avg_monthly_sales)), share0_ref4.cc_name -> Sort (cost=0.00..1293.01 rows=1 width=72) (actual time=10789.933..10789.995 rows=357 loops=1) Sort Key: ((share0_ref4.sum_sales - share0_ref4.avg_monthly_sales)), share0_ref4.cc_name Sort Method: quicksort Memory: 14998kB -> Result (cost=0.00..1293.01 rows=1 width=150) (actual time=10648.280..10774.898 rows=12379 loops=1) Filter: ((share0_ref4.d_year = 1999) AND (share0_ref4.avg_monthly_sales > '0'::numeric) AND (CASE WHEN (share0_ref4.avg_monthly_sales > '0'::numeric) THEN (abs((share0_ref4.sum_sales - share0_ref4.avg_monthly_sales)) / share0_ref4.avg_monthly_sales) ELSE NULL::numeric END > 0.1)) -> Hash Join (cost=0.00..1293.01 rows=1 width=150) (actual time=10648.253..10740.262 rows=13582 loops=1) Hash Cond: ((share0_ref4.i_category = share0_ref3.i_category) AND (share0_ref4.i_brand = share0_ref3.i_brand) AND ((share0_ref4.cc_name)::text = (share0_ref3.cc_name)::text) AND (share0_ref4.rn = (share0_ref3.rn + 1)) AND (share0_ref4.rn = (share0_ref2.rn - 1))) -> Shared Scan (share slice:id 9:0) (cost=0.00..431.00 rows=1 width=142) (actual time=0.013..5.570 rows=16510 loops=1) -> Hash (cost=862.00..862.00 rows=1 width=142) (actual time=10647.380..10647.380 rows=209076 loops=1) Buckets: 65536 (originally 32768) Batches: 2 (originally 1) Memory Usage: 31389kB -> Hash Join (cost=0.00..862.00 rows=1 width=142) (actual time=10156.494..10374.421 rows=209076 loops=1) Hash Cond: ((share0_ref3.i_category = share0_ref2.i_category) AND (share0_ref3.i_brand = share0_ref2.i_brand) AND ((share0_ref3.cc_name)::text = (share0_ref2.cc_name)::text)) -> Shared Scan (share slice:id 9:0) (cost=0.00..431.00 rows=1 width=126) (actual time=0.009..6.887 rows=16510 loops=1) -> Hash (cost=431.00..431.00 rows=1 width=126) (actual time=10156.297..10156.298 rows=16178 loops=1) Buckets: 32768 Batches: 1 Memory Usage: 3144kB -> Shared Scan (share slice:id 9:0) (cost=0.00..431.00 rows=1 width=126) (actual time=10139.421..10144.473 rows=16510 loops=1) Planning Time: 1905.667 ms (slice0) Executor memory: 330K bytes. (slice1) Executor memory: 4750K bytes avg x 2 workers, 4968K bytes max (seg1). Work_mem: 4861K bytes max. (slice2) Executor memory: 4701K bytes avg x 2 workers, 4952K bytes max (seg0). Work_mem: 4894K bytes max. (slice3) Executor memory: 12428K bytes avg x 2 workers, 12428K bytes max (seg0). Work_mem: 12375K bytes max. * (slice4) Executor memory: 14021K bytes avg x 2 workers, 14021K bytes max (seg0). Work_mem: 12493K bytes max, 221759K bytes wanted. (slice5) Executor memory: 77K bytes avg x 2 workers, 77K bytes max (seg0). (slice6) Executor memory: 323K bytes avg x 2 workers, 323K bytes max (seg0). Work_mem: 257K bytes max. (slice7) Executor memory: 39K bytes avg x 2 workers, 39K bytes max (seg0). (slice8) Executor memory: 242K bytes (entry db). * (slice9) Executor memory: 35344K bytes avg x 2 workers, 35360K bytes max (seg1). Work_mem: 31389K bytes max, 37501K bytes wanted. Memory used: 128000kB Memory wanted: 3328681kB Optimizer: Pivotal Optimizer (GPORCA) Execution Time: 10856.507 ms (86 rows) Time: 12779.991 ms (00:12.780) ``` There is only one share slice in this query, one producer in slice 1, three consumers in slice 9. However, when I turned GUC off, the query never returns, and the process situation looks like: ``` postgres 22285 22255 0 03:03 pts/1 00:00:00 psql -p9221 postgres 22288 20912 3 03:03 ? 00:00:03 postgres: 9221, postgres tpcds [local] con150 cmd16 EXPLAIN postgres 22294 20939 0 03:03 ? 00:00:00 postgres: 9210, postgres tpcds 172.17.0.50(60732) con150 seg0 cmd17 slice1 MPPEXEC SELECT postgres 22295 20950 0 03:03 ? 00:00:00 postgres: 9211, postgres tpcds 172.17.0.50(36177) con150 seg1 cmd17 slice1 MPPEXEC SELECT postgres 22306 20939 5 03:03 ? 00:00:04 postgres: 9210, postgres tpcds 172.17.0.50(60742) con150 seg0 idle postgres 22307 20950 5 03:03 ? 00:00:04 postgres: 9211, postgres tpcds 172.17.0.50(36187) con150 seg1 idle postgres 22310 20939 11 03:03 ? 00:00:10 postgres: 9210, postgres tpcds 172.17.0.50(60745) con150 seg0 idle postgres 22311 20950 12 03:03 ? 00:00:11 postgres: 9211, postgres tpcds 172.17.0.50(36190) con150 seg1 idle postgres 22314 20939 5 03:03 ? 00:00:04 postgres: 9210, postgres tpcds 172.17.0.50(60748) con150 seg0 idle postgres 22315 20950 5 03:03 ? 00:00:04 postgres: 9211, postgres tpcds 172.17.0.50(36193) con150 seg1 idle postgres 22318 20939 1 03:03 ? 00:00:01 postgres: 9210, postgres tpcds 172.17.0.50(60750) con150 seg0 idle postgres 22319 20950 2 03:03 ? 00:00:01 postgres: 9211, postgres tpcds 172.17.0.50(36195) con150 seg1 idle postgres 22322 20912 0 03:03 ? 00:00:00 postgres: 9221, postgres tpcds [local] con150 seg-1 idle postgres 22324 20939 0 03:03 ? 00:00:00 postgres: 9210, postgres tpcds 172.17.0.50(60754) con150 seg0 idle postgres 22325 20950 0 03:03 ? 00:00:00 postgres: 9211, postgres tpcds 172.17.0.50(36199) con150 seg1 idle postgres 22348 20939 0 03:05 ? 00:00:00 postgres: 9210, postgres tpcds 172.17.0.50(45936) con150 seg0 idle postgres 22349 20950 0 03:05 ? 00:00:00 postgres: 9211, postgres tpcds 172.17.0.50(49614) con150 seg1 idle postgres 22352 20939 4 03:05 ? 00:00:00 postgres: 9210, postgres tpcds 172.17.0.50(45939) con150 seg0 idle postgres 22353 20950 4 03:05 ? 00:00:00 postgres: 9211, postgres tpcds 172.17.0.50(49617) con150 seg1 idle ``` According to my debugging, the stack of slice 1 processes looks like: ``` #0 0x00007fde606f94f3 in epoll_wait () from /lib64/libc.so.6 #1 0x0000000000d2eec1 in WaitEventSetWaitBlock (set=0x87d8fe0, cur_timeout=-1, occurred_events=0x7ffce695fe00, nevents=1) at latch.c:1081 #2 0x0000000000d2ed9a in WaitEventSetWait (set=0x87d8fe0, timeout=-1, occurred_events=0x7ffce695fe00, nevents=1, wait_event_info=0) at latch.c:1033 #3 0x0000000000d5987d in ConditionVariableSleep (cv=0x7fde540890b0, wait_event_info=0) at condition_variable.c:157 #4 0x0000000000b30a61 in shareinput_writer_waitdone (ref=0x87da950, nconsumers=1) at nodeShareInputScan.c:994 #5 0x0000000000b2fe89 in ExecEndShareInputScan (node=0x88c2ec0) at nodeShareInputScan.c:522 #6 0x0000000000ad63e8 in ExecEndNode (node=0x88c2ec0) at execProcnode.c:888 #7 0x0000000000b3237b in ExecEndSequence (node=0x88c2d80) at nodeSequence.c:132 #8 0x0000000000ad623f in ExecEndNode (node=0x88c2d80) at execProcnode.c:779 #9 0x0000000000b1772e in ExecEndSort (node=0x88c2658) at nodeSort.c:365 ``` That is to say, the producer is waiting for consumers to wake it up, while the consumers didn't. According to further debugging, I found a **squelch** is triggered on the *Gather Motion* node upstream of three ShareInputScan consumer nodes. In the squelch logic of ShareInputScan, the consumer will notify producer only if `ndone == nsharers`: ```c local_state->ndone++; if (local_state->ndone == local_state->nsharers) { shareinput_reader_notifydone(node->ref, sisc->nconsumers); local_state->closed = true; } ``` While `ndone` will be accumulated one by one consumer, `nsharers` is initialized in ExecInitNode. However, GUC `execute_pruned_plan` affects the root node where the Executor starts to call `ExecInitNode`: - `execute_pruned_plan` set to true: the initialization will start at the root node of slice 9, `nsharers` will be 3 - `execute_pruned_plan` set to false: the initialization will start at the root node of the whole plan tree, `nsharers` will be 4, then `ndone == nsharers` will never establish, because we only have three consumers, `ndone` will be 3 at most According to my understanding, the algorithm should work well no matter this GUC is set to true or false. So I add some conditions in the process of initialization of `nsharers`: to accumulate `nsharers` only when initializing consumer nodes of current slice. Then this algorithm should be working fine.

In commit a0684da we made a change to use the fixed TOAST_TUPLE_TARGET instead of custom toast_tuple_target that's calculated based on the the blocksize of the AO table for toasting AO table tuples. This caused an issue that some tuples that are well beyond the toast_tuple_target are not being toasted (because they're smaller than TOAST_TUPLE_TARGET). This is ok when the tuples can still fit into the AO table's blocksize. But if not, an error would occur. E.g.: postgres=# create table t (a int, b int[]) WITH (appendonly=true, blocksize=8192); CREATE TABLE postgres=# insert into t select 1, array_agg(x) from generate_series(1, 2030) x; ERROR: item too long (check #1): length 8160, maxBufferLen 8168 CONTEXT: Append-Only table 't' Therefore, we should revert those changes, including: 1. Still use the toast_tuple_target for toasting 'x' and 'e' columns for AO tables. There's also a GPDB_12_MERGE_FIXME in the original comment suggesting to use RelationGetToastTupleTarget for AO table (in order to unify the logic to calculate maxDataLen for heap and AO tuples, I suppose). But that's not possible currently because the calculated toast_tuple_target is stored in AppendOnlyInsertDesc and we cannot get it from the Relation struct. And it seems to me that we won't have a way to do that easily. Therefore, keep this FIXME comment being removed. 2. Still use the toast_tuple_target for toasting 'm' columns for AO tables. Also restore the FIXME comment suggesting that we should use a bigger target size for 'm' columns. This should be something that worth investigating more into. Commit a0684da also includes a change to use custom toast_tuple_target for heap tables, which should be a valid change. I think that fixed an oversight when merging the upstream commit c251336.

We used to rename index-backed constraints in the EXCHANGE PARTITION command in 6X. Now we don't. We've decided to keep that behavior in 7X after looking into the opposing arguments: Argument #1. It might cause problem during upgrade. - Firstly, we won't be using legacy syntax in the dump scripts so we just need to worry about the existing tables produced by EXCHANGE PARTITION. I.e. whether or not they can be restored correctly. - For upgrading from 6X->7X, since those tables already have matched constraint and index names with the table names, we should be OK. - For upgrading 7X->onward, since we implement EXCHANGE PARTIITON simply as a combination of upstream-syntax commands (see AtExecGPExchangePartition()), pg_upgrade should be able to handle them. We've verified that manually and the automated test should cover that too. In fact, this gives another point that we shouldn't do our own hacky things as part of EXCHANGE PARTITION which might confuse pg_upgrade. Argument #2. It might surprise the users and their existing workloads. - The indexed constraint names are all implicitly generated and shouldn't be directly used in most cases. - They are not the only thing that will appear changed. E.g. the normal indexes (e.g. B-tree ones) will be too. So the decision to change one should be made with changing all of them. More details see: https://docs.google.com/document/d/1enJdKYxkk5WFRV1WoqIgLgRxCGxOqI2nglJVE_Wglec

gpdb_get_master_data_dir should set master_data_dir variable with non-NULL pointer. However, there is codepath in this function that leads to NULL result. We need to check this case and finish gpmon process with error if any trouble. This case is really reproduced in our production (gdb) bt #0 __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:65 #1 0x00007f18fc1de9ce in __GI___strdup (s=0x0) at strdup.c:41 #2 0x000055ff3885813d in getconfig () at gpmmon.c:1679 #3 main (argc=<optimized out>, argv=<optimized out>) at gpmmon.c:1358 (gdb) Quit

…586) My previous commit 8915cd0 caused coredump in some pipeline jobs. Example stack: ``` Core was generated by `postgres: 7000, ic proxy process Program terminated with signal SIGSEGV, Segmentation fault. #0 0x0000000000b46ec3 in pg_atomic_read_u32_impl (ptr=0x7f05a8c51104) at ../../../../src/include/port/atomics/generic.h:48 (gdb) bt #0 0x0000000000b46ec3 in pg_atomic_read_u32_impl (ptr=0x7f05a8c51104) at ../../../../src/include/port/atomics/generic.h:48 #1 pg_atomic_read_u32 (ptr=0x7f05a8c51104) at ../../../../src/include/port/atomics.h:247 #2 LWLockAttemptLock (mode=LW_EXCLUSIVE, lock=0x7f05a8c51100) at lwlock.c:751 #3 LWLockAcquire (lock=0x7f05a8c51100, mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1188 #4 0x0000000000b32fff in ShmemInitStruct (name=name@entry=0x130e160 "", size=size@entry=4, foundPtr=foundPtr@entry=0x7ffcf94513bf) at shmem.c:412 #5 0x0000000000d6d18e in ic_proxy_server_main () at ic_proxy_main.c:545 #6 0x0000000000d6c219 in ICProxyMain (main_arg=<optimized out>) at ic_proxy_bgworker.c:36 #7 0x0000000000aa9caa in StartBackgroundWorker () at bgworker.c:955 #8 0x0000000000ab9407 in do_start_bgworker (rw=<optimized out>) at postmaster.c:6450 #9 maybe_start_bgworkers () at postmaster.c:6706 #10 0x0000000000abbc59 in ServerLoop () at postmaster.c:2095 #11 0x0000000000abd777 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x36e3650) at postmaster.c:1633 #12 0x00000000006e4764 in main (argc=5, argv=0x36e3650) at main.c:240 (gdb) p *ptr Cannot access memory at address 0x7f05a8c51104 ``` The root cause is I forgot to init SHM structure at CreateSharedMemoryAndSemaphores(). Fix it in this commit.

## Problem An error occurs in python lib when a plpython function is executed. After our analysis, in the user's cluster, a plpython UDF was running with the unstable network, and got a timeout error: `failed to acquire resources on one or more segments`. Then a plpython UDF was run in the same session, and the UDF failed with GC error. Here is the core dump: ``` 2023-11-24 10:15:18.945507 CST,,,p2705198,th2081832064,,,,0,,,seg-1,,,,,"LOG","00000","3rd party error log: #0 0x7f7c68b6d55b in frame_dealloc /home/cc/repo/cpython/Objects/frameobject.c:509:5 #1 0x7f7c68b5109d in gen_send_ex /home/cc/repo/cpython/Objects/genobject.c:108:9 #2 0x7f7c68af9ddd in PyIter_Next /home/cc/repo/cpython/Objects/abstract.c:3118:14 #3 0x7f7c78caa5c0 in PLy_exec_function /home/cc/repo/gpdb6/src/pl/plpython/plpy_exec.c:134:11 #4 0x7f7c78cb5ffb in plpython_call_handler /home/cc/repo/gpdb6/src/pl/plpython/plpy_main.c:387:13 #5 0x562f5e008bb5 in ExecMakeTableFunctionResult /home/cc/repo/gpdb6/src/backend/executor/execQual.c:2395:13 #6 0x562f5e0dddec in FunctionNext_guts /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:142:5 #7 0x562f5e0da094 in FunctionNext /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:350:11 #8 0x562f5e03d4b0 in ExecScanFetch /home/cc/repo/gpdb6/src/backend/executor/execScan.c:84:9 #9 0x562f5e03cd8f in ExecScan /home/cc/repo/gpdb6/src/backend/executor/execScan.c:154:10 #10 0x562f5e0da072 in ExecFunctionScan /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:380:9 #11 0x562f5e001a1c in ExecProcNode /home/cc/repo/gpdb6/src/backend/executor/execProcnode.c:1071:13 #12 0x562f5dfe6377 in ExecutePlan /home/cc/repo/gpdb6/src/backend/executor/execMain.c:3202:10 #13 0x562f5dfe5bf4 in standard_ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:1171:5 #14 0x562f5dfe4877 in ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:992:4 #15 0x562f5e857e69 in PortalRunSelect /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1164:4 #16 0x562f5e856d3f in PortalRun /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1005:18 #17 0x562f5e84607a in exec_simple_query /home/cc/repo/gpdb6/src/backend/tcop/postgres.c:1848:10 ``` ## Reproduce We can use a simple procedure to reproduce the above problem: - set timeout GUC: `gpconfig -c gp_segment_connect_timeout -v 5` and `gpstop -ari` - prepare function: ``` CREATE EXTENSION plpythonu; CREATE OR REPLACE FUNCTION test_func() RETURNS SETOF int AS $$ plpy.execute("select pg_backend_pid()") for i in range(0, 5): yield (i) $$ LANGUAGE plpythonu; ``` - exit from the current psql session. - stop the postmaster of segment: `gdb -p "the pid of segment postmaster"` - enter a psql session. - call `SELECT test_func();` and get error ``` gpadmin=# select test_func(); ERROR: function "test_func" error fetching next item from iterator (plpy_elog.c:121) DETAIL: Exception: failed to acquire resources on one or more segments CONTEXT: Traceback (most recent call last): PL/Python function "test_func" ``` - quit gdb and make postmaster runnable. - call `SELECT test_func();` again and get panic ``` gpadmin=# SELECT test_func(); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !> ``` ## Analysis - There is an SPI call in test_func(): `plpy.execute()`. - Then coordinator will start a subtransaction by PLy_spi_subtransaction_begin(); - Meanwhile, if the segment cannot receive the instruction from the coordinator, the subtransaction beginning procedure return fails. - BUT! The Python processor does not know whether an error happened and does not clean its environment. - Then the next plpython UDF in the same session will fail due to the wrong Python environment. ## Solution - Use try-catch to catch the exception caused by PLy_spi_subtransaction_begin() - set the python error indicator by PLy_spi_exception_set() backport from #16856 Co-authored-by: Chen Mulong <chenmulong@gmail.com>

An error occurs in python lib when a plpython function is executed. After our analysis, in the user's cluster, a plpython UDF was running with the unstable network, and got a timeout error: `failed to acquire resources on one or more segments`. Then a plpython UDF was run in the same session, and the UDF failed with GC error. Here is the core dump: ``` 2023-11-24 10:15:18.945507 CST,,,p2705198,th2081832064,,,,0,,,seg-1,,,,,"LOG","00000","3rd party error log: #0 0x7f7c68b6d55b in frame_dealloc /home/cc/repo/cpython/Objects/frameobject.c:509:5 #1 0x7f7c68b5109d in gen_send_ex /home/cc/repo/cpython/Objects/genobject.c:108:9 #2 0x7f7c68af9ddd in PyIter_Next /home/cc/repo/cpython/Objects/abstract.c:3118:14 #3 0x7f7c78caa5c0 in PLy_exec_function /home/cc/repo/gpdb6/src/pl/plpython/plpy_exec.c:134:11 #4 0x7f7c78cb5ffb in plpython_call_handler /home/cc/repo/gpdb6/src/pl/plpython/plpy_main.c:387:13 #5 0x562f5e008bb5 in ExecMakeTableFunctionResult /home/cc/repo/gpdb6/src/backend/executor/execQual.c:2395:13 #6 0x562f5e0dddec in FunctionNext_guts /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:142:5 #7 0x562f5e0da094 in FunctionNext /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:350:11 #8 0x562f5e03d4b0 in ExecScanFetch /home/cc/repo/gpdb6/src/backend/executor/execScan.c:84:9 #9 0x562f5e03cd8f in ExecScan /home/cc/repo/gpdb6/src/backend/executor/execScan.c:154:10 #10 0x562f5e0da072 in ExecFunctionScan /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:380:9 #11 0x562f5e001a1c in ExecProcNode /home/cc/repo/gpdb6/src/backend/executor/execProcnode.c:1071:13 #12 0x562f5dfe6377 in ExecutePlan /home/cc/repo/gpdb6/src/backend/executor/execMain.c:3202:10 #13 0x562f5dfe5bf4 in standard_ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:1171:5 #14 0x562f5dfe4877 in ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:992:4 #15 0x562f5e857e69 in PortalRunSelect /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1164:4 #16 0x562f5e856d3f in PortalRun /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1005:18 #17 0x562f5e84607a in exec_simple_query /home/cc/repo/gpdb6/src/backend/tcop/postgres.c:1848:10 ``` We can use a simple procedure to reproduce the above problem: - set timeout GUC: `gpconfig -c gp_segment_connect_timeout -v 5` and `gpstop -ari` - prepare function: ``` CREATE EXTENSION plpythonu; CREATE OR REPLACE FUNCTION test_func() RETURNS SETOF int AS $$ plpy.execute("select pg_backend_pid()") for i in range(0, 5): yield (i) $$ LANGUAGE plpythonu; ``` - exit from the current psql session. - stop the postmaster of segment: `gdb -p "the pid of segment postmaster"` - enter a psql session. - call `SELECT test_func();` and get error ``` gpadmin=# select test_func(); ERROR: function "test_func" error fetching next item from iterator (plpy_elog.c:121) DETAIL: Exception: failed to acquire resources on one or more segments CONTEXT: Traceback (most recent call last): PL/Python function "test_func" ``` - quit gdb and make postmaster runnable. - call `SELECT test_func();` again and get panic ``` gpadmin=# SELECT test_func(); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !> ``` - There is an SPI call in test_func(): `plpy.execute()`. - Then coordinator will start a subtransaction by PLy_spi_subtransaction_begin(); - Meanwhile, if the segment cannot receive the instruction from the coordinator, the subtransaction beginning procedure return fails. - BUT! The Python processor does not know whether an error happened and does not clean its environment. - Then the next plpython UDF in the same session will fail due to the wrong Python environment. - Use try-catch to catch the exception caused by PLy_spi_subtransaction_begin() - set the python error indicator by PLy_spi_exception_set() backport from #16856 Co-authored-by: Chen Mulong <chenmulong@gmail.com> (cherry picked from commit 45d6ba8)

## Problem An error occurs in python lib when a plpython function is executed. After our analysis, in the user's cluster, a plpython UDF was running with the unstable network, and got a timeout error: `failed to acquire resources on one or more segments`. Then a plpython UDF was run in the same session, and the UDF failed with GC error. Here is the core dump: ``` 2023-11-24 10:15:18.945507 CST,,,p2705198,th2081832064,,,,0,,,seg-1,,,,,"LOG","00000","3rd party error log: #0 0x7f7c68b6d55b in frame_dealloc /home/cc/repo/cpython/Objects/frameobject.c:509:5 #1 0x7f7c68b5109d in gen_send_ex /home/cc/repo/cpython/Objects/genobject.c:108:9 #2 0x7f7c68af9ddd in PyIter_Next /home/cc/repo/cpython/Objects/abstract.c:3118:14 #3 0x7f7c78caa5c0 in PLy_exec_function /home/cc/repo/gpdb6/src/pl/plpython/plpy_exec.c:134:11 #4 0x7f7c78cb5ffb in plpython_call_handler /home/cc/repo/gpdb6/src/pl/plpython/plpy_main.c:387:13 #5 0x562f5e008bb5 in ExecMakeTableFunctionResult /home/cc/repo/gpdb6/src/backend/executor/execQual.c:2395:13 #6 0x562f5e0dddec in FunctionNext_guts /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:142:5 #7 0x562f5e0da094 in FunctionNext /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:350:11 #8 0x562f5e03d4b0 in ExecScanFetch /home/cc/repo/gpdb6/src/backend/executor/execScan.c:84:9 #9 0x562f5e03cd8f in ExecScan /home/cc/repo/gpdb6/src/backend/executor/execScan.c:154:10 #10 0x562f5e0da072 in ExecFunctionScan /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:380:9 #11 0x562f5e001a1c in ExecProcNode /home/cc/repo/gpdb6/src/backend/executor/execProcnode.c:1071:13 #12 0x562f5dfe6377 in ExecutePlan /home/cc/repo/gpdb6/src/backend/executor/execMain.c:3202:10 #13 0x562f5dfe5bf4 in standard_ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:1171:5 #14 0x562f5dfe4877 in ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:992:4 #15 0x562f5e857e69 in PortalRunSelect /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1164:4 #16 0x562f5e856d3f in PortalRun /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1005:18 #17 0x562f5e84607a in exec_simple_query /home/cc/repo/gpdb6/src/backend/tcop/postgres.c:1848:10 ``` ## Reproduce We can use a simple procedure to reproduce the above problem: - set timeout GUC: `gpconfig -c gp_segment_connect_timeout -v 5` and `gpstop -ari` - prepare function: ``` CREATE EXTENSION plpythonu; CREATE OR REPLACE FUNCTION test_func() RETURNS SETOF int AS $$ plpy.execute("select pg_backend_pid()") for i in range(0, 5): yield (i) $$ LANGUAGE plpythonu; ``` - exit from the current psql session. - stop the postmaster of segment: `gdb -p "the pid of segment postmaster"` - enter a psql session. - call `SELECT test_func();` and get error ``` gpadmin=# select test_func(); ERROR: function "test_func" error fetching next item from iterator (plpy_elog.c:121) DETAIL: Exception: failed to acquire resources on one or more segments CONTEXT: Traceback (most recent call last): PL/Python function "test_func" ``` - quit gdb and make postmaster runnable. - call `SELECT test_func();` again and get panic ``` gpadmin=# SELECT test_func(); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !> ``` ## Analysis - There is an SPI call in test_func(): `plpy.execute()`. - Then coordinator will start a subtransaction by PLy_spi_subtransaction_begin(); - Meanwhile, if the segment cannot receive the instruction from the coordinator, the subtransaction beginning procedure return fails. - BUT! The Python processor does not know whether an error happened and does not clean its environment. - Then the next plpython UDF in the same session will fail due to the wrong Python environment. ## Solution - Use try-catch to catch the exception caused by PLy_spi_subtransaction_begin() - set the python error indicator by PLy_spi_exception_set() Co-authored-by: Chen Mulong <chenmulong@gmail.com>

An error occurs in python lib when a plpython function is executed. After our analysis, in the user's cluster, a plpython UDF was running with the unstable network, and got a timeout error: `failed to acquire resources on one or more segments`. Then a plpython UDF was run in the same session, and the UDF failed with GC error. Here is the core dump: ``` 2023-11-24 10:15:18.945507 CST,,,p2705198,th2081832064,,,,0,,,seg-1,,,,,"LOG","00000","3rd party error log: #0 0x7f7c68b6d55b in frame_dealloc /home/cc/repo/cpython/Objects/frameobject.c:509:5 #1 0x7f7c68b5109d in gen_send_ex /home/cc/repo/cpython/Objects/genobject.c:108:9 #2 0x7f7c68af9ddd in PyIter_Next /home/cc/repo/cpython/Objects/abstract.c:3118:14 #3 0x7f7c78caa5c0 in PLy_exec_function /home/cc/repo/gpdb6/src/pl/plpython/plpy_exec.c:134:11 #4 0x7f7c78cb5ffb in plpython_call_handler /home/cc/repo/gpdb6/src/pl/plpython/plpy_main.c:387:13 #5 0x562f5e008bb5 in ExecMakeTableFunctionResult /home/cc/repo/gpdb6/src/backend/executor/execQual.c:2395:13 #6 0x562f5e0dddec in FunctionNext_guts /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:142:5 #7 0x562f5e0da094 in FunctionNext /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:350:11 #8 0x562f5e03d4b0 in ExecScanFetch /home/cc/repo/gpdb6/src/backend/executor/execScan.c:84:9 #9 0x562f5e03cd8f in ExecScan /home/cc/repo/gpdb6/src/backend/executor/execScan.c:154:10 #10 0x562f5e0da072 in ExecFunctionScan /home/cc/repo/gpdb6/src/backend/executor/nodeFunctionscan.c:380:9 #11 0x562f5e001a1c in ExecProcNode /home/cc/repo/gpdb6/src/backend/executor/execProcnode.c:1071:13 #12 0x562f5dfe6377 in ExecutePlan /home/cc/repo/gpdb6/src/backend/executor/execMain.c:3202:10 #13 0x562f5dfe5bf4 in standard_ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:1171:5 #14 0x562f5dfe4877 in ExecutorRun /home/cc/repo/gpdb6/src/backend/executor/execMain.c:992:4 #15 0x562f5e857e69 in PortalRunSelect /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1164:4 #16 0x562f5e856d3f in PortalRun /home/cc/repo/gpdb6/src/backend/tcop/pquery.c:1005:18 #17 0x562f5e84607a in exec_simple_query /home/cc/repo/gpdb6/src/backend/tcop/postgres.c:1848:10 ``` We can use a simple procedure to reproduce the above problem: - set timeout GUC: `gpconfig -c gp_segment_connect_timeout -v 5` and `gpstop -ari` - prepare function: ``` CREATE EXTENSION plpythonu; CREATE OR REPLACE FUNCTION test_func() RETURNS SETOF int AS $$ plpy.execute("select pg_backend_pid()") for i in range(0, 5): yield (i) $$ LANGUAGE plpythonu; ``` - exit from the current psql session. - stop the postmaster of segment: `gdb -p "the pid of segment postmaster"` - enter a psql session. - call `SELECT test_func();` and get error ``` gpadmin=# select test_func(); ERROR: function "test_func" error fetching next item from iterator (plpy_elog.c:121) DETAIL: Exception: failed to acquire resources on one or more segments CONTEXT: Traceback (most recent call last): PL/Python function "test_func" ``` - quit gdb and make postmaster runnable. - call `SELECT test_func();` again and get panic ``` gpadmin=# SELECT test_func(); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !> ``` - There is an SPI call in test_func(): `plpy.execute()`. - Then coordinator will start a subtransaction by PLy_spi_subtransaction_begin(); - Meanwhile, if the segment cannot receive the instruction from the coordinator, the subtransaction beginning procedure return fails. - BUT! The Python processor does not know whether an error happened and does not clean its environment. - Then the next plpython UDF in the same session will fail due to the wrong Python environment. - Use try-catch to catch the exception caused by PLy_spi_subtransaction_begin() - set the python error indicator by PLy_spi_exception_set() backport from #16856 Co-authored-by: Chen Mulong <chenmulong@gmail.com> (cherry picked from commit 45d6ba8) Co-authored-by: Zhang Hao <hzhang2@vmware.com>

yihong0618 added 3 commits March 15, 2022 13:01

fix: plpython2 test failed

1b88e41

fix: plpython3 test

db5a90c

fix: some typos

6740776

beeender reviewed Mar 25, 2022

View reviewed changes

fix: some bug by comments

cd9a4af

yihong0618 merged commit 8b94b9e into mc/plpython3 Mar 30, 2022

fix: no need the plpython_return_2 file

01b0ed5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: CI failed plpython and plpython3 #1

fix: CI failed plpython and plpython3 #1

yihong0618 commented Mar 23, 2022

beeender left a comment

beeender Mar 25, 2022

beeender Mar 25, 2022

beeender Mar 30, 2022

beeender Mar 25, 2022

beeender Mar 25, 2022

beeender Mar 25, 2022

		@@ -0,0 +1,3154 @@
		-- Tests Datatypes for Input Parameters and return values

fix: CI failed plpython and plpython3 #1

fix: CI failed plpython and plpython3 #1

Conversation

yihong0618 commented Mar 23, 2022

Here are some reminders before you submit the pull request

beeender left a comment

Choose a reason for hiding this comment

beeender Mar 25, 2022

Choose a reason for hiding this comment

beeender Mar 25, 2022

Choose a reason for hiding this comment

beeender Mar 30, 2022

Choose a reason for hiding this comment

beeender Mar 25, 2022

Choose a reason for hiding this comment

beeender Mar 25, 2022

Choose a reason for hiding this comment

beeender Mar 25, 2022

Choose a reason for hiding this comment