panic: Non-nullable column with no value! #29374

leomkkwan · 2018-08-30T19:49:33Z

Running v2.1.0-beta.20180827 on a 5 nodes cluster. One of the node crash with this panic message.

The column screenview:lastmodified was defined with default value.

lastmodified TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now()

panic: Non-nullable column "screenview:lastmodified" with no value! Index scanned was "primary" with the index key columns (fp) and the values ('a6b301a0-1e78-4263-b351-e1347c99c2fa')

goroutine 160961941 [running]:
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*RowFetcher).finalizeRow(0xc42687fc90, 0x2efbe40, 0xc43a67a0c0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/rowfetcher.go:1276 +0x9d9
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*RowFetcher).NextRow(0xc42687fc90, 0x2efbe40, 0xc43a67a0c0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/rowfetcher.go:1028 +0x183
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*rowFetcherWrapper).Next(0xc430304ca0, 0xc42b1efb30, 0x0, 0x0, 0xc42b1efb30)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/tablereader.go:141 +0x4a
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*tableReader).Next(0xc42687f800, 0x0, 0x0, 0x0, 0xc42b1efb30)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/tablereader.go:238 +0x4e
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.Run(0x2efbe40, 0xc446d8aba0, 0x2f06300, 0xc42687f800, 0x2eda000, 0xc432698000)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:170 +0x35
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).Run(0xc42687f800, 0x2efbe40, 0xc446d8aba0, 0xc42be2ff38)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:739 +0x98
created by github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).StartAsync
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:574 +0x1e3

The text was updated successfully, but these errors were encountered:

jordanlewis · 2018-08-30T20:00:25Z

@leomkkwan did you add that column as part of a schema change, or was it there from the first CREATE TABLE statement?

leomkkwan · 2018-08-30T20:44:44Z

I have that column from the first create table statement; however, when the panic pops, I was adding index to another table under the same database. Not sure if that related.

That alter table job get killed by the system and a roll back task created automatically, which seems to me that rollback is stuck. I will create another ticket if that rollback job really stuck.

vivekmenezes · 2018-08-31T01:12:54Z

@leomkkwan thanks for reporting this problem. Can you share with us the output of SHOW JOBS
How did you get data into the table? Did you use IMPORT or any other such bulk import tools? Thanks!

leomkkwan · 2018-08-31T15:19:55Z

Show jobs output

` 374446927381528578 | SCHEMA CHANGE | CREATE INDEX ON filu.public.t (aid) STORING (vid) | leo | canceled | 2018-08-15 14:15:33.591238+00:00 | 2018-08-15 14:15:33.70979+00:00 | 2018-08-28 14:21:32.679021+00:00 | 2018-08-28 14:21:32.679022+00:00 | 0.0788539797067642 | | NULL

374470972645703682 | SCHEMA CHANGE | CREATE INDEX ON filu.public.t (aid) STORING (vid) | leo | canceled | 2018-08-15 16:17:51.627163+00:00 | NULL | 2018-08-15 17:27:22.350865+00:00 | 2018-08-17 08:48:47.430118+00:00 | 0 | | NULL

374834006255828996 | SCHEMA CHANGE | ROLL BACK CREATE INDEX ON filu.public.t (aid) STORING (vid) | leo | succeeded | 2018-08-16 23:04:20.698192+00:00 | 2018-08-16 23:04:20.722213+00:00 | 2018-08-17 08:45:12.546743+00:00 | 2018-08-17 08:45:12.546744+00:00 | 1 | | NULL

374948913002020868 | SCHEMA CHANGE | ROLL BACK CREATE INDEX ON filu.public.t (aid) STORING (vid) | leo | succeeded | 2018-08-17 08:48:47.454237+00:00 | 2018-08-17 08:48:47.478727+00:00 | 2018-08-17 08:53:34.756627+00:00 | 2018-08-17 08:53:34.756628+00:00 | 1 | | NULL

378131188709654529 | SCHEMA CHANGE | CREATE INDEX ON filu.public.t (aid) STORING (vid) | root | failed | 2018-08-28 14:34:40.922791+00:00 | 2018-08-28 14:34:41.020696+00:00 | 2018-08-30 18:14:18.347473+00:00 | 2018-08-30 18:14:18.347474+00:00 | 0.155584752559662 | [n1] communication error: rpc error: code = Canceled desc = context canceled | NULL

378740599575347201 | SCHEMA CHANGE | ROLL BACK CREATE INDEX ON filu.public.t (aid) STORING (vid) | root | succeeded | 2018-08-30 18:14:18.314705+00:00 | 2018-08-30 18:14:18.412797+00:00 | 2018-08-31 12:09:41.857125+00:00 | 2018-08-31 12:09:41.857127+00:00 | 1 | | NULL
`

The first 4 lines was from previous alpha version, which I mentioned in this issue #28784

I am not importing anything. What I did is use CRDB as a log for a web server, so there was a lot of concurrent signal row insert statement to that table. The statement is like this

insert into screenview (fp,tp,pp,landingurl,agent,ip,referurl,payload) values (@fp,@tp,@pp,@landingurl,@agent,@ip,@referurl,@payload) RETURNING NOTHING;

On a side note, the "panic" issue was happened only once, however, a random node will go 00m and get killed if I run the above test.

vivekmenezes · 2018-08-31T18:56:08Z

The CREATE INDEX schema change failure you saw has been fixed through #28769

However what you are seeing a panic which shouldn't be happening.

So the evidence suggests that this has nothing to do with IMPORT or schema changes because the schema changes involved were CREATE INDEX and were not modifying any columns.

Thanks for reporting this issue, it requires further investigation. I'll continue investigating this.

vivekmenezes · 2018-08-31T19:33:23Z

@leomkkwan can you run SELECT * FROM screenview WHERE fp = 'a6b301a0-1e78-4263-b351-e1347c99c2fa' which will help establish if the value stored is indeed NULL. Thanks!

leomkkwan · 2018-09-04T15:23:11Z

Here is the output

fp tp pp date lastmodified payload

a6b301a0-1e78-4263-b351-e1347c99c2fa	a6b301a0-1e78-4263-b351-e1347c99c2fa	a6b301a0-1e78-4263-b351-e1347c99c2fa	2018-08-30 14:11:21	2018-08-30 14:11:21	{"param1": "1", "param2": "2"}

The row have been successfully insert.

vivekmenezes · 2018-09-04T17:49:48Z

@leomkkwan this seems to indicate that the INSERT succeeded correctly and some SELECT request crashed with the above panic. Do you know what request this might be? Usually the request that paniced is in the logs. Thanks

leomkkwan · 2018-09-04T18:02:56Z

@vivekmenezes May I email you the log file so you can take a look?

vivekmenezes · 2018-09-04T18:36:51Z

my email is vivek@cockroachlabs.com . Please do not send me any sensitive information. Thanks

leomkkwan · 2018-09-04T19:53:21Z

Hi Vivek, I've sent you the log file and it doesn't contain any sensitive information.
Please take a look, and if those "retrying adding checkpoint" for my create index process relate to this issue.
Thanks.

vivekmenezes · 2018-09-05T01:57:10Z

I180830 18:13:33.371469 537 server/status/runtime.go:433  [n4] runtime stats: 10 GiB RSS, 360 goroutines, 238 MiB/154 MiB/562 MiB GO alloc/idle/total, 7.9 GiB/9.9 GiB CGO alloc/total, 199311.67cgo/sec, 1.04/0.05 %(u/s)time, 0.00 %gc (3x)
I180830 18:13:34.451296 539 server/status/recorder.go:652  [n4,summaries] available memory from cgroups (8.0 EiB) exceeds system memory 30 GiB, using system memory
I180830 18:13:36.358093 160915042 sql/distsqlrun/backfiller.go:226  [n4,IndexBackfiller=90] retrying adding checkpoint /Table/90/1/190{744971/16064/0-863091/1491} to table t_demographic_survey
I180830 18:13:43.374066 537 server/status/runtime.go:433  [n4] runtime stats: 10 GiB RSS, 345 goroutines, 217 MiB/176 MiB/576 MiB GO alloc/idle/total, 7.9 GiB/9.9 GiB CGO alloc/total, 66491.83cgo/sec, 0.73/0.06 %(u/s)time, 0.00 %gc (2x)
I180830 18:13:44.451265 539 server/status/recorder.go:652  [n4,summaries] available memory from cgroups (8.0 EiB) exceeds system memory 30 GiB, using system memory
I180830 18:13:53.372411 537 server/status/runtime.go:433  [n4] runtime stats: 10 GiB RSS, 293 goroutines, 214 MiB/181 MiB/577 MiB GO alloc/idle/total, 7.9 GiB/9.9 GiB CGO alloc/total, 174182.79cgo/sec, 0.92/0.05 %(u/s)time, 0.00 %gc (2x)
I180830 18:13:54.451300 539 server/status/recorder.go:652  [n4,summaries] available memory from cgroups (8.0 EiB) exceeds system memory 30 GiB, using system memory
I180830 18:13:57.312758 160933402 sql/distsqlrun/backfiller.go:226  [n4,IndexBackfiller=90] retrying adding checkpoint /Table/90/1/190{749311/14696/0-863091/1491} to table t_demographic_survey
I180830 18:14:03.372848 537 server/status/runtime.go:433  [n4] runtime stats: 10 GiB RSS, 273 goroutines, 183 MiB/201 MiB/580 MiB GO alloc/idle/total, 7.9 GiB/9.9 GiB CGO alloc/total, 2959.47cgo/sec, 0.60/0.06 %(u/s)time, 0.00 %gc (2x)
I180830 18:14:04.451217 539 server/status/recorder.go:652  [n4,summaries] available memory from cgroups (8.0 EiB) exceeds system memory 30 GiB, using system memory
I180830 18:14:07.663524 160942899 sql/distsqlrun/backfiller.go:226  [n4,IndexBackfiller=90] retrying adding checkpoint /Table/90/1/190{751541/14676/0-863091/1491} to table t_demographic_survey
I180830 18:14:07.779057 160942899 sql/distsqlrun/backfiller.go:226  [n4,IndexBackfiller=90] retrying adding checkpoint /Table/90/1/190{751541/14676/0-863091/1491} to table t_demographic_survey
I180830 18:14:13.372046 537 server/status/runtime.go:433  [n4] runtime stats: 10 GiB RSS, 262 goroutines, 331 MiB/120 MiB/580 MiB GO alloc/idle/total, 7.9 GiB/9.9 GiB CGO alloc/total, 90983.63cgo/sec, 0.73/0.06 %(u/s)time, 0.00 %gc (1x)
I180830 18:14:14.451285 539 server/status/recorder.go:652  [n4,summaries] available memory from cgroups (8.0 EiB) exceeds system memory 30 GiB, using system memory
panic: Non-nullable column "screenview:lastmidifed" with no value! Index scanned was "primary" with the index key columns (fp) and the values ('a6b301a0-1e78-4263-b351-e1347c99c2fa')

goroutine 160961941 [running]:
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*RowFetcher).finalizeRow(0xc42687fc90, 0x2efbe40, 0xc43a67a0c0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/rowfetcher.go:1276 +0x9d9
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*RowFetcher).NextRow(0xc42687fc90, 0x2efbe40, 0xc43a67a0c0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/rowfetcher.go:1028 +0x183
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*rowFetcherWrapper).Next(0xc430304ca0, 0xc42b1efb30, 0x0, 0x0, 0xc42b1efb30)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/tablereader.go:141 +0x4a
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*tableReader).Next(0xc42687f800, 0x0, 0x0, 0x0, 0xc42b1efb30)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/tablereader.go:238 +0x4e
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.Run(0x2efbe40, 0xc446d8aba0, 0x2f06300, 0xc42687f800, 0x2eda000, 0xc432698000)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:170 +0x35
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).Run(0xc42687f800, 0x2efbe40, 0xc446d8aba0, 0xc42be2ff38)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:739 +0x98
created by github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).StartAsync
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:574 +0x1e3

I'm now betting that this crash is happening in the index backfiller. The index backfiller runs through a scan of the table and it hits this panic. Normally panics caused by statements result in the panic showing up logging an anonymized version of the statement.

jordanlewis · 2018-09-05T02:07:55Z

I don't think it's the index backfiller. You'd see that in the stack trace.

The reason that there's no statement printed is that this is a flow on a remote node. You can tell because the bottom of the stack trace is from StartAsync.

leomkkwan · 2018-09-09T03:04:48Z

The server is upgraded to v2.1.0-beta.20180904.

I can replicate this panic by running the following query when the concurrent insert is happening.

select * from universe.screenview where pp= '129d5825-f5b9-4535-bcad-5aaf2ad85bba'

Where '129d5825-f5b9-4535-bcad-5aaf2ad85bba' is exists and the panic message will be for another currently inserting row.

panic: Non-nullable column "screenview:lastmodified" with no value! Index scanned was "primary" with the index key columns (fp) and the values ('563e10ba-1a4c-4416-81aa-dcf9dcbb6662')

vivekmenezes · 2018-09-10T00:27:33Z

@leomkkwan the INSERT should not be running through the code path that is hitting this panic. Nor would the SELECT. Are there any other queries being run on this system? Perhaps the statements page on the admin UI would provide you with answers? Thanks for reporting this issue.

jordanlewis · 2018-09-10T01:55:25Z

Vivek, I do think the select would run through this code path. I can show you more on Monday.

leomkkwan · 2018-09-10T13:56:00Z

There were only two tasks running when this happened. Insert from npgsql C# client and select statement from Dbeaver client.

vivekmenezes · 2018-09-10T14:15:52Z

@leomkkwan can you share the schema with us using SHOW CREATE TABLE screenview? I'd like to reproduce this on our end.

You can confirm that select * from universe.screenview where fp = '563e10ba-1a4c-4416-81aa-dcf9dcbb6662' is returning a lastmodified value right?

It would also be nice to see the output of EXPLAIN select * from universe.screenview where pp= '129d5825-f5b9-4535-bcad-5aaf2ad85bba'

Thanks!

leomkkwan · 2018-09-10T14:32:20Z

CREATE TABLE screenview (
fp UUID NOT NULL DEFAULT gen_random_uuid(),
tp UUID NOT NULL,
pp UUID NOT NULL,
nfp UUID NULL,
pfp UUID NULL,
date TIMESTAMPTZ NOT NULL DEFAULT now(),
lastmodified TIMESTAMPTZ NOT NULL DEFAULT now(),
ip INET NULL,
landingurl STRING NULL,
referurl STRING NULL,
agent STRING NULL,
sa0 STRING NULL,
sa1 STRING NULL,
sa2 STRING NULL,
sa3 STRING NULL,
sa4 STRING NULL,
sa5 STRING NULL,
sa6 STRING NULL,
sa7 STRING NULL,
sa8 STRING NULL,
sa9 STRING NULL,
payload JSONB NULL,
CONSTRAINT "primary" PRIMARY KEY (fp ASC),
FAMILY "primary" (fp, tp, pp, date, ip, landingurl, referurl, agent, sa0, sa1, sa2, sa3, sa4, sa5, sa6, sa7, sa8, sa9, payload),
FAMILY f1 (nfp, pfp, lastmodified)
)

You can confirm that select * from universe.screenview where fp = '563e10ba-1a4c-4416-81aa-dcf9dcbb6662' is returning a lastmodified value right? Confirmed.

EXPLAIN select * from universe.screenview where pp= '129d5825-f5b9-4535-bcad-5aaf2ad85bba'
scan
table screenview@primary
spans ALL
A side note as the above query is searching pp field, which is not the primary key.

solongordon · 2018-09-11T19:34:56Z

Actually it looks like this is simply a case of TableReader not draining when it encounters an error from the rowfetcher. This causes it to continue Nexting the rowfetcher, which is now in a bad state, and generating errors in a tight loop. I think this got broken when stats were added to tablereader early in this release cycle.

I'll fix this and also do an audit for other places where processors might not be draining when a downstream error is encountered.

jordanlewis · 2018-09-11T22:44:55Z

Hmm, it occurs to me that I don't directly understand how the missing drain call could cause this panic - do you? If you call NextRow after this error condition, can you see how it could hit the panic?

I found several places where processors were not checking for error metadata from their inputs and draining appropriately. The most egregious of these is in TableReader, which was not calling MoveToDraining when it encountered an error from the underlying RowFetcher. This meant that rather than draining it would continue calling NextRow and generating errors in a tight loop, which caused the query to hang as well as high CPU and memory usage on the affected node. Other affected processors are Distinct, SortChunks, and SortTopK. Fixes cockroachdb#29374 Fixes cockroachdb#29978 Release note: None

30096: distsql: add missing MoveToDraining calls r=solongordon a=solongordon I found several places where processors were not checking for error metadata from their inputs and draining appropriately. The most egregious of these is in TableReader, which was not calling MoveToDraining when it encountered an error from the underlying RowFetcher. This meant that rather than draining it would continue calling NextRow and generating errors in a tight loop, which caused the query to hang as well as high CPU and memory usage on the affected node. Other affected processors are Distinct, SortChunks, and SortTopK. Fixes #29374 Fixes #29978 Release note: None Co-authored-by: Solon Gordon <solon@cockroachlabs.com>

solongordon · 2018-09-12T20:34:39Z

@jordanlewis Yeah, I have not had any luck reproducing the panic. It seems possible that continuing to call NextRow after an error could trigger the panic. Something similar happened in my repro, where initially the ReadWithinUncertaintyIntervalError was thrown but a different error (single entry value with no default column id) was thrown on subsequent NextRow calls.

@leomkkwan, would you be willing to retest with my fix to see if the panic persists?

leomkkwan · 2018-09-12T20:56:30Z

@solongordon Do you have linux binary which include your fix so I can test?

solongordon · 2018-09-12T21:04:31Z

Yes, you can get a bleeding edge binary via https://edge-binaries.cockroachdb.com/cockroach/cockroach.linux-gnu-amd64.LATEST

leomkkwan · 2018-09-13T16:46:46Z

I'm testing with the latest binary. The server did not go panic, but the first few tried return the following error. After like a minute, everything looks good.

Will keep testing on this, but just want to let you know what I found.

SQL Error [40001]: ERROR: restart transaction: HandledRetryableTxnError: ReadWithinUncertaintyIntervalError: read at time 1536854492.069197097,0 encountered previous write with future timestamp 1536854492.148524676,0 within uncertainty interval t <= 1536854492.387521394,0; observed timestamps: [{2 1536854492.069197097,0} {3 1536854493.137278554,0} {4 1536854492.387521394,0} {5 1536854492.070576462,0}]
ERROR: restart transaction: HandledRetryableTxnError: ReadWithinUncertaintyIntervalError: read at time 1536854492.069197097,0 encountered previous write with future timestamp 1536854492.148524676,0 within uncertainty interval t <= 1536854492.387521394,0; observed timestamps: [{2 1536854492.069197097,0} {3 1536854493.137278554,0} {4 1536854492.387521394,0} {5 1536854492.070576462,0}]
ERROR: restart transaction: HandledRetryableTxnError: ReadWithinUncertaintyIntervalError: read at time 1536854492.069197097,0 encountered previous write with future timestamp 1536854492.148524676,0 within uncertainty interval t <= 1536854492.387521394,0; observed timestamps: [{2 1536854492.069197097,0} {3 1536854493.137278554,0} {4 1536854492.387521394,0} {5 1536854492.070576462,0}]

jordanlewis · 2018-09-17T23:00:44Z

I think I may have tracked down the root cause of this finally.

Here's a log from kvfetcher right before a similar panic:

I180917 22:25:36.235901 9420 sql/sqlbase/kvfetcher.go:306  [n6] Scan /Table/13/1/2018-0{6-22T03:20:05.532895Z/359029819147124737-8-14T11:51:32.262383Z/374135496300593156}, /Table/13/{1/2018-08-29T04:31:55.204231Z/378295796590739458-2}
I180917 22:25:36.460483 9420 sql/sqlbase/kvfetcher.go:306  [n6] Scan /Table/13/1/2018-0{7-19T23:24:46.368931Z/366910788100030465/5/1-8-14T11:51:32.262383Z/374135496300593156}, /Table/13/{1/2018-08-29T04:31:55.204231Z/378295796590739458-2}
panic: Non-nullable column "rangelog:rangeID" with no value! Index scanned was "primary" with the index key columns (timestamp,uniqueID) and the values ('2018-07-19 23:24:46.368931+00:00',366910788100030465)

goroutine 9420 [running]:
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*RowFetcher).finalizeRow(0xc428849490, 0x2f941a0, 0xc42727ce40)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/rowfetcher.go:1261 +0x9d9
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*RowFetcher).NextRow(0xc428849490, 0x2f941a0, 0xc42727ce40, 0x74b576, 0x5ba029e0, 0xc41be8feaa, 0x3c2ec08af8, 0xc428cc1e78, 0xcca136, 0xbee028581be8feaa)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/rowfetcher.go:1013 +0x183
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*rowFetcherWrapper).Next(0xc427843ae0, 0xed33220e0, 0x0, 0xc42673b1e0, 0x7)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/tablereader.go:141 +0x4a

If you look closely at the second Scan line, you can see that the start key of the first component of the scan is starting from the 5th column family of the table. This seems like a lead, because the rangeID column that causes the panic is inside of the 2nd column family. Normally I think it should be okay to start from the middle of a column family if the cause of the second scan was pagination, since the kvfetcher is supposed to keep state, including the last keys from the last fetch.

In this case, though, there's a second span in the span list - I have a feeling this is tripping up some code and causing the kv/rowfetcher to lose track of the keys it saw in the last batch. To soothe my own mind, I verified that the other user that saw this issue saw it before #26553 was merged, which is the change that made MVCCScan return byte batches instead of slices of KeyValue.

jordanlewis · 2018-09-17T23:41:00Z

Good news, I managed to reproduce this in a logic test.

# LogicTest: 5node-dist-opt

statement ok
CREATE TABLE p1 (
  a INT NOT NULL,
  b INT NOT NULL DEFAULT 0,
  c INT NOT NULL DEFAULT 0,
  d INT NOT NULL DEFAULT 0,
  PRIMARY KEY (a),
  FAMILY (a),
  FAMILY (b),
  FAMILY (c),
  FAMILY (d)
)

statement ok
INSERT INTO p1(a) SELECT * FROM generate_series(1, 1000)

statement ok
SET CLUSTER SETTING kv.range_merge.queue_enabled = false;

statement ok
ALTER TABLE p1 SPLIT AT SELECT i*100 FROM generate_series(1,10) as g(i)

statement ok
ALTER TABLE p1 EXPERIMENTAL_RELOCATE SELECT ARRAY[((i - 1) % 5) + 1], i FROM generate_series(1,10) AS g(i)

statement ok
ALTER TABLE p1 SCATTER

query IIII
SELECT * FROM p1
----

I180917 23:40:48.003068 55 sql/logictest/logic.go:880  SELECT * FROM p1;
I180917 23:40:48.011986 2248 sql/sqlbase/kvfetcher.go:306  [n2] Scan /Table/20/1/5{3-4}
I180917 23:40:48.014098 2290 sql/sqlbase/kvfetcher.go:306  [n2] Scan /Table/53/1{-/100}
I180917 23:40:48.014167 2169 sql/sqlbase/kvfetcher.go:306  [n5] Scan /Table/53/1/{300-400}
I180917 23:40:48.014372 2290 sql/sqlbase/kvfetcher.go:306  [n2] Scan /Table/53/1/1{/1/1-00}
I180917 23:40:48.014415 2270 sql/sqlbase/kvfetcher.go:306  [n1,client=127.0.0.1:60381,user=root] Scan /Table/53/1/{100-300}, /Table/53/{1/400-2}
I180917 23:40:48.014680 2270 sql/sqlbase/kvfetcher.go:306  [n1,client=127.0.0.1:60381,user=root] Scan /Table/53/1/{100/1/1-300}, /Table/53/{1/400-2}
panic: Non-nullable column "p1:b" with no value! Index scanned was "primary" with the index key columns (a) and the values (100)

goroutine 2270 [running]:
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*RowFetcher).finalizeRow(0xc003572490, 0x654e520, 0xc002f4eea0)
        /Users/jordan/repo/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/rowfetcher.go:1255 +0xa22
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*RowFetcher).NextRow(0xc003572490, 0x654e520, 0xc002f4eea0, 0x2, 0x2, 0x1, 0x10001, 0x0, 0x0, 0x0)
        /Users/jordan/repo/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/rowfetcher.go:1007 +0x19a
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*rowFetcherWrapper).Next(0xc0003e35e0, 0x654e520, 0xc002f4eea0, 0xc0049e1560, 0xc00253aea0)
        /Users/jordan/repo/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/tablereader.go:141 +0x4a
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*tableReader).Next(0xc003572000, 0x0, 0x654e520, 0x654e460, 0xc002f4eea0)
        /Users/jordan/repo/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/tablereader.go:242 +0x4c
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.Run(0x654e460, 0xc0020bc600, 0x6557920, 0xc003572000, 0x65313e0, 0xc004086380)
        /Users/jordan/repo/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:170 +0x35
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).Run(0xc003572000, 0x654e460, 0xc0020bc600, 0xc002750888)
        /Users/jordan/repo/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:731 +0x96
created by github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).startInternal
        /Users/jordan/repo/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:550 +0x2f2
make: *** [testbaselogic] Error 2

jordanlewis · 2018-09-17T23:43:40Z

Oh, actually, I also have a code modification to enable this to panic: unconditionally setting the limitHint of StartScan from tableReader to 1 to try to provoke this quickly without a lot of data.

diff --git a/pkg/sql/distsqlrun/tablereader.go b/pkg/sql/distsqlrun/tablereader.go
index 56ba193fd6..c113f5fa2c 100644
--- a/pkg/sql/distsqlrun/tablereader.go
+++ b/pkg/sql/distsqlrun/tablereader.go
@@ -229,7 +229,7 @@ func (tr *tableReader) Start(ctx context.Context) context.Context {
        tr.input.Start(fetcherCtx)
        if err := tr.fetcher.StartScan(
                fetcherCtx, tr.flowCtx.txn, tr.spans,
-               true /* limit batches */, tr.limitHint, tr.flowCtx.traceKV,
+               true /* limit batches */, 1, tr.flowCtx.traceKV,
        ); err != nil {
                tr.MoveToDraining(err)
        }

jordanlewis · 2018-09-18T00:28:19Z

OK, I found the problem - it's a very simple mistake. The RowFetcher mistakenly assumes that a row can't span 2 kv batches. As soon as a new batch occurs (maybeNewSpan is set to true from the kv fetcher), RowFetcher sets its internal "row done" state to true. This will cause problems when a batch inevitably ends in the middle of a row, which isn't prohibited anywhere.

My main question is why none of our tests noticed this. Perhaps we should be testing with small batch sizes everywhere. And multiple column families.

jordanlewis · 2018-09-18T00:30:17Z

I believe this was introduced by #23474, so it wasn't present in 2.0 at least.

@vivekmenezes

30394: sqlbase: bugfix in rowfetcher with multiple cfs r=jordanlewis a=jordanlewis A bug (introduced during the 2.1 cycle) caused scans issued via distsql against tables with multiple column families to return incorrect results or panic the server if some of the column families were marked as non-nullable. This bug could only occur in distsql because it only manifested when a rowfetcher was initialized with multiple input spans, which only occurs in distsql where a particular node owns 2 non-contiguous portions of the keyspace. In this case, the code would misinterpret the empty response from the scan request on the second input span to mean that there was a limit on the last scan and that we might have entered a new input span, which caused the rowfetcher to mark the current row as ended and expect a full row starting from the next key retrieved. This was not the case, as this could happen in normal circumstances. When a row had multiple column families, the rowfetcher would fail to remember the results it had already seen for the earlier column families in a row, if a batch happened to end half way through the row's column families, and return incorrect and/or panic depending on the definition of the table. This patch removes this problem and adds a test for it. The test is a little complicated but I verified that it does in fact panic before the patch. It should catch problems like this in the future. We had no explicit tests of the rowfetcher before with multiple input spans. Note that this patch also deletes a test case, that tests correct behavior in the case of seeing duplicate rows from the input stream during an index join. This test case is contrived - index joins can never encounter duplicate rows from the input stream - so it's safe to delete. Closes #29374. Partially reverts #23474. The code no longer has protection against the condition that that PR was trying to guard against. But I don't think this is a problem in practice. If it crops up in new code it'll be very easy to notice and fix at that point. cc @vivekmenezes Release note: None 30423: ui: fix default jobs page sort r=couchand a=couchand In #18137 a column was added, but the default sort (which is specified by index) was not updated. This updates it to correctly sort on creation time. Fixes: #30418 Release note (admin ui change): Fixes an issue where the default sort on the Jobs page was incorrectly the User column rather than the Creation Time. Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com> Co-authored-by: Andrew Couch <andrew@cockroachlabs.com>

jordanlewis assigned vivekmenezes Aug 30, 2018

jordanlewis added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting labels Aug 30, 2018

jordanlewis mentioned this issue Sep 3, 2018

sql: v2.0.5: panic while executing 1 statements: UPSERT panics in AddRow #29436

Closed

leomkkwan mentioned this issue Sep 9, 2018

fatal error: runtime: out of memory #29978

Closed

vivekmenezes assigned jordanlewis and unassigned vivekmenezes Sep 10, 2018

solongordon mentioned this issue Sep 11, 2018

distsql: add missing MoveToDraining calls #30096

Merged

solongordon mentioned this issue Sep 11, 2018

release-2.1: distsql: add missing MoveToDraining calls #30099

Merged

craig bot closed this as completed in #30096 Sep 12, 2018

solongordon reopened this Sep 12, 2018

vivekmenezes mentioned this issue Sep 14, 2018

sql: Crash caused by panic: Non-nullable column with no value! #26938

Closed

jordanlewis mentioned this issue Sep 19, 2018

sqlbase: bugfix in rowfetcher with multiple cfs #30394

Merged

craig bot closed this as completed in #30394 Sep 19, 2018

jordanlewis mentioned this issue Sep 20, 2018

backport-2.1: sqlbase: bugfix in rowfetcher with multiple cfs #30444

Merged

jordanlewis mentioned this issue Sep 25, 2018

sql: improve precision of index spans to scan for multi-column-family single-row fetches #18168

Closed

jordanlewis mentioned this issue Oct 7, 2018

sql: panic on GRANT ALL ON DATABASE (v2.0.5) #31038

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

panic: Non-nullable column with no value! #29374

panic: Non-nullable column with no value! #29374

leomkkwan commented Aug 30, 2018 •

edited by jordanlewis

Loading

jordanlewis commented Aug 30, 2018

leomkkwan commented Aug 30, 2018 •

edited

Loading

vivekmenezes commented Aug 31, 2018

leomkkwan commented Aug 31, 2018

vivekmenezes commented Aug 31, 2018 •

edited

Loading

vivekmenezes commented Aug 31, 2018

leomkkwan commented Sep 4, 2018 •

edited by vivekmenezes

Loading

vivekmenezes commented Sep 4, 2018

leomkkwan commented Sep 4, 2018

vivekmenezes commented Sep 4, 2018

leomkkwan commented Sep 4, 2018

vivekmenezes commented Sep 5, 2018

jordanlewis commented Sep 5, 2018

leomkkwan commented Sep 9, 2018 •

edited

Loading

vivekmenezes commented Sep 10, 2018

jordanlewis commented Sep 10, 2018

leomkkwan commented Sep 10, 2018

vivekmenezes commented Sep 10, 2018

leomkkwan commented Sep 10, 2018

solongordon commented Sep 11, 2018

jordanlewis commented Sep 11, 2018

solongordon commented Sep 12, 2018

leomkkwan commented Sep 12, 2018

solongordon commented Sep 12, 2018

leomkkwan commented Sep 13, 2018

jordanlewis commented Sep 17, 2018

jordanlewis commented Sep 17, 2018

jordanlewis commented Sep 17, 2018

jordanlewis commented Sep 18, 2018

jordanlewis commented Sep 18, 2018

panic: Non-nullable column with no value! #29374

panic: Non-nullable column with no value! #29374

Comments

leomkkwan commented Aug 30, 2018 • edited by jordanlewis Loading

jordanlewis commented Aug 30, 2018

leomkkwan commented Aug 30, 2018 • edited Loading

vivekmenezes commented Aug 31, 2018

leomkkwan commented Aug 31, 2018

vivekmenezes commented Aug 31, 2018 • edited Loading

vivekmenezes commented Aug 31, 2018

leomkkwan commented Sep 4, 2018 • edited by vivekmenezes Loading

vivekmenezes commented Sep 4, 2018

leomkkwan commented Sep 4, 2018

vivekmenezes commented Sep 4, 2018

leomkkwan commented Sep 4, 2018

vivekmenezes commented Sep 5, 2018

jordanlewis commented Sep 5, 2018

leomkkwan commented Sep 9, 2018 • edited Loading

vivekmenezes commented Sep 10, 2018

jordanlewis commented Sep 10, 2018

leomkkwan commented Sep 10, 2018

vivekmenezes commented Sep 10, 2018

leomkkwan commented Sep 10, 2018

solongordon commented Sep 11, 2018

jordanlewis commented Sep 11, 2018

solongordon commented Sep 12, 2018

leomkkwan commented Sep 12, 2018

solongordon commented Sep 12, 2018

leomkkwan commented Sep 13, 2018

jordanlewis commented Sep 17, 2018

jordanlewis commented Sep 17, 2018

jordanlewis commented Sep 17, 2018

jordanlewis commented Sep 18, 2018

jordanlewis commented Sep 18, 2018

leomkkwan commented Aug 30, 2018 •

edited by jordanlewis

Loading

leomkkwan commented Aug 30, 2018 •

edited

Loading

vivekmenezes commented Aug 31, 2018 •

edited

Loading

leomkkwan commented Sep 4, 2018 •

edited by vivekmenezes

Loading

leomkkwan commented Sep 9, 2018 •

edited

Loading