Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic: Non-nullable column with no value! #29374

Closed
leomkkwan opened this issue Aug 30, 2018 · 31 comments · Fixed by #30394
Closed

panic: Non-nullable column with no value! #29374

leomkkwan opened this issue Aug 30, 2018 · 31 comments · Fixed by #30394
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting

Comments

@leomkkwan
Copy link

leomkkwan commented Aug 30, 2018

Running v2.1.0-beta.20180827 on a 5 nodes cluster. One of the node crash with this panic message.

The column screenview:lastmodified was defined with default value.

lastmodified TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now()

panic: Non-nullable column "screenview:lastmodified" with no value! Index scanned was "primary" with the index key columns (fp) and the values ('a6b301a0-1e78-4263-b351-e1347c99c2fa')

goroutine 160961941 [running]:
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*RowFetcher).finalizeRow(0xc42687fc90, 0x2efbe40, 0xc43a67a0c0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/rowfetcher.go:1276 +0x9d9
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*RowFetcher).NextRow(0xc42687fc90, 0x2efbe40, 0xc43a67a0c0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/rowfetcher.go:1028 +0x183
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*rowFetcherWrapper).Next(0xc430304ca0, 0xc42b1efb30, 0x0, 0x0, 0xc42b1efb30)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/tablereader.go:141 +0x4a
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*tableReader).Next(0xc42687f800, 0x0, 0x0, 0x0, 0xc42b1efb30)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/tablereader.go:238 +0x4e
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.Run(0x2efbe40, 0xc446d8aba0, 0x2f06300, 0xc42687f800, 0x2eda000, 0xc432698000)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:170 +0x35
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).Run(0xc42687f800, 0x2efbe40, 0xc446d8aba0, 0xc42be2ff38)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:739 +0x98
created by github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).StartAsync
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:574 +0x1e3
@jordanlewis jordanlewis added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting labels Aug 30, 2018
@jordanlewis
Copy link
Member

@leomkkwan did you add that column as part of a schema change, or was it there from the first CREATE TABLE statement?

@leomkkwan
Copy link
Author

leomkkwan commented Aug 30, 2018

I have that column from the first create table statement; however, when the panic pops, I was adding index to another table under the same database. Not sure if that related.

That alter table job get killed by the system and a roll back task created automatically, which seems to me that rollback is stuck. I will create another ticket if that rollback job really stuck.

@vivekmenezes
Copy link
Contributor

@leomkkwan thanks for reporting this problem. Can you share with us the output of SHOW JOBS
How did you get data into the table? Did you use IMPORT or any other such bulk import tools? Thanks!

@leomkkwan
Copy link
Author

Show jobs output

` 374446927381528578 | SCHEMA CHANGE | CREATE INDEX ON filu.public.t (aid) STORING (vid) | leo | canceled | 2018-08-15 14:15:33.591238+00:00 | 2018-08-15 14:15:33.70979+00:00 | 2018-08-28 14:21:32.679021+00:00 | 2018-08-28 14:21:32.679022+00:00 | 0.0788539797067642 | | NULL

374470972645703682 | SCHEMA CHANGE | CREATE INDEX ON filu.public.t (aid) STORING (vid) | leo | canceled | 2018-08-15 16:17:51.627163+00:00 | NULL | 2018-08-15 17:27:22.350865+00:00 | 2018-08-17 08:48:47.430118+00:00 | 0 | | NULL

374834006255828996 | SCHEMA CHANGE | ROLL BACK CREATE INDEX ON filu.public.t (aid) STORING (vid) | leo | succeeded | 2018-08-16 23:04:20.698192+00:00 | 2018-08-16 23:04:20.722213+00:00 | 2018-08-17 08:45:12.546743+00:00 | 2018-08-17 08:45:12.546744+00:00 | 1 | | NULL

374948913002020868 | SCHEMA CHANGE | ROLL BACK CREATE INDEX ON filu.public.t (aid) STORING (vid) | leo | succeeded | 2018-08-17 08:48:47.454237+00:00 | 2018-08-17 08:48:47.478727+00:00 | 2018-08-17 08:53:34.756627+00:00 | 2018-08-17 08:53:34.756628+00:00 | 1 | | NULL

378131188709654529 | SCHEMA CHANGE | CREATE INDEX ON filu.public.t (aid) STORING (vid) | root | failed | 2018-08-28 14:34:40.922791+00:00 | 2018-08-28 14:34:41.020696+00:00 | 2018-08-30 18:14:18.347473+00:00 | 2018-08-30 18:14:18.347474+00:00 | 0.155584752559662 | [n1] communication error: rpc error: code = Canceled desc = context canceled | NULL

378740599575347201 | SCHEMA CHANGE | ROLL BACK CREATE INDEX ON filu.public.t (aid) STORING (vid) | root | succeeded | 2018-08-30 18:14:18.314705+00:00 | 2018-08-30 18:14:18.412797+00:00 | 2018-08-31 12:09:41.857125+00:00 | 2018-08-31 12:09:41.857127+00:00 | 1 | | NULL
`

The first 4 lines was from previous alpha version, which I mentioned in this issue #28784

I am not importing anything. What I did is use CRDB as a log for a web server, so there was a lot of concurrent signal row insert statement to that table. The statement is like this

insert into screenview (fp,tp,pp,landingurl,agent,ip,referurl,payload) values (@fp,@tp,@pp,@landingurl,@agent,@ip,@referurl,@payload) RETURNING NOTHING;


On a side note, the "panic" issue was happened only once, however, a random node will go 00m and get killed if I run the above test.

@vivekmenezes
Copy link
Contributor

vivekmenezes commented Aug 31, 2018

The CREATE INDEX schema change failure you saw has been fixed through #28769

However what you are seeing a panic which shouldn't be happening.

So the evidence suggests that this has nothing to do with IMPORT or schema changes because the schema changes involved were CREATE INDEX and were not modifying any columns.

Thanks for reporting this issue, it requires further investigation. I'll continue investigating this.

@vivekmenezes
Copy link
Contributor

@leomkkwan can you run SELECT * FROM screenview WHERE fp = 'a6b301a0-1e78-4263-b351-e1347c99c2fa' which will help establish if the value stored is indeed NULL. Thanks!

@leomkkwan
Copy link
Author

leomkkwan commented Sep 4, 2018

Here is the output

fp tp pp date lastmodified payload

a6b301a0-1e78-4263-b351-e1347c99c2fa	a6b301a0-1e78-4263-b351-e1347c99c2fa	a6b301a0-1e78-4263-b351-e1347c99c2fa	2018-08-30 14:11:21	2018-08-30 14:11:21	{"param1": "1", "param2": "2"}

The row have been successfully insert.

@vivekmenezes
Copy link
Contributor

@leomkkwan this seems to indicate that the INSERT succeeded correctly and some SELECT request crashed with the above panic. Do you know what request this might be? Usually the request that paniced is in the logs. Thanks

@leomkkwan
Copy link
Author

@vivekmenezes May I email you the log file so you can take a look?

@vivekmenezes
Copy link
Contributor

my email is vivek@cockroachlabs.com . Please do not send me any sensitive information. Thanks

@leomkkwan
Copy link
Author

Hi Vivek, I've sent you the log file and it doesn't contain any sensitive information.
Please take a look, and if those "retrying adding checkpoint" for my create index process relate to this issue.
Thanks.

@vivekmenezes
Copy link
Contributor

I180830 18:13:33.371469 537 server/status/runtime.go:433  [n4] runtime stats: 10 GiB RSS, 360 goroutines, 238 MiB/154 MiB/562 MiB GO alloc/idle/total, 7.9 GiB/9.9 GiB CGO alloc/total, 199311.67cgo/sec, 1.04/0.05 %(u/s)time, 0.00 %gc (3x)
I180830 18:13:34.451296 539 server/status/recorder.go:652  [n4,summaries] available memory from cgroups (8.0 EiB) exceeds system memory 30 GiB, using system memory
I180830 18:13:36.358093 160915042 sql/distsqlrun/backfiller.go:226  [n4,IndexBackfiller=90] retrying adding checkpoint /Table/90/1/190{744971/16064/0-863091/1491} to table t_demographic_survey
I180830 18:13:43.374066 537 server/status/runtime.go:433  [n4] runtime stats: 10 GiB RSS, 345 goroutines, 217 MiB/176 MiB/576 MiB GO alloc/idle/total, 7.9 GiB/9.9 GiB CGO alloc/total, 66491.83cgo/sec, 0.73/0.06 %(u/s)time, 0.00 %gc (2x)
I180830 18:13:44.451265 539 server/status/recorder.go:652  [n4,summaries] available memory from cgroups (8.0 EiB) exceeds system memory 30 GiB, using system memory
I180830 18:13:53.372411 537 server/status/runtime.go:433  [n4] runtime stats: 10 GiB RSS, 293 goroutines, 214 MiB/181 MiB/577 MiB GO alloc/idle/total, 7.9 GiB/9.9 GiB CGO alloc/total, 174182.79cgo/sec, 0.92/0.05 %(u/s)time, 0.00 %gc (2x)
I180830 18:13:54.451300 539 server/status/recorder.go:652  [n4,summaries] available memory from cgroups (8.0 EiB) exceeds system memory 30 GiB, using system memory
I180830 18:13:57.312758 160933402 sql/distsqlrun/backfiller.go:226  [n4,IndexBackfiller=90] retrying adding checkpoint /Table/90/1/190{749311/14696/0-863091/1491} to table t_demographic_survey
I180830 18:14:03.372848 537 server/status/runtime.go:433  [n4] runtime stats: 10 GiB RSS, 273 goroutines, 183 MiB/201 MiB/580 MiB GO alloc/idle/total, 7.9 GiB/9.9 GiB CGO alloc/total, 2959.47cgo/sec, 0.60/0.06 %(u/s)time, 0.00 %gc (2x)
I180830 18:14:04.451217 539 server/status/recorder.go:652  [n4,summaries] available memory from cgroups (8.0 EiB) exceeds system memory 30 GiB, using system memory
I180830 18:14:07.663524 160942899 sql/distsqlrun/backfiller.go:226  [n4,IndexBackfiller=90] retrying adding checkpoint /Table/90/1/190{751541/14676/0-863091/1491} to table t_demographic_survey
I180830 18:14:07.779057 160942899 sql/distsqlrun/backfiller.go:226  [n4,IndexBackfiller=90] retrying adding checkpoint /Table/90/1/190{751541/14676/0-863091/1491} to table t_demographic_survey
I180830 18:14:13.372046 537 server/status/runtime.go:433  [n4] runtime stats: 10 GiB RSS, 262 goroutines, 331 MiB/120 MiB/580 MiB GO alloc/idle/total, 7.9 GiB/9.9 GiB CGO alloc/total, 90983.63cgo/sec, 0.73/0.06 %(u/s)time, 0.00 %gc (1x)
I180830 18:14:14.451285 539 server/status/recorder.go:652  [n4,summaries] available memory from cgroups (8.0 EiB) exceeds system memory 30 GiB, using system memory
panic: Non-nullable column "screenview:lastmidifed" with no value! Index scanned was "primary" with the index key columns (fp) and the values ('a6b301a0-1e78-4263-b351-e1347c99c2fa')

goroutine 160961941 [running]:
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*RowFetcher).finalizeRow(0xc42687fc90, 0x2efbe40, 0xc43a67a0c0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/rowfetcher.go:1276 +0x9d9
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*RowFetcher).NextRow(0xc42687fc90, 0x2efbe40, 0xc43a67a0c0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/rowfetcher.go:1028 +0x183
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*rowFetcherWrapper).Next(0xc430304ca0, 0xc42b1efb30, 0x0, 0x0, 0xc42b1efb30)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/tablereader.go:141 +0x4a
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*tableReader).Next(0xc42687f800, 0x0, 0x0, 0x0, 0xc42b1efb30)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/tablereader.go:238 +0x4e
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.Run(0x2efbe40, 0xc446d8aba0, 0x2f06300, 0xc42687f800, 0x2eda000, 0xc432698000)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:170 +0x35
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).Run(0xc42687f800, 0x2efbe40, 0xc446d8aba0, 0xc42be2ff38)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:739 +0x98
created by github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).StartAsync
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:574 +0x1e3

I'm now betting that this crash is happening in the index backfiller. The index backfiller runs through a scan of the table and it hits this panic. Normally panics caused by statements result in the panic showing up logging an anonymized version of the statement.

@jordanlewis
Copy link
Member

I don't think it's the index backfiller. You'd see that in the stack trace.

The reason that there's no statement printed is that this is a flow on a remote node. You can tell because the bottom of the stack trace is from StartAsync.

@leomkkwan
Copy link
Author

leomkkwan commented Sep 9, 2018

The server is upgraded to v2.1.0-beta.20180904.

I can replicate this panic by running the following query when the concurrent insert is happening.

select * from universe.screenview where pp= '129d5825-f5b9-4535-bcad-5aaf2ad85bba'

Where '129d5825-f5b9-4535-bcad-5aaf2ad85bba' is exists and the panic message will be for another currently inserting row.

panic: Non-nullable column "screenview:lastmodified" with no value! Index scanned was "primary" with the index key columns (fp) and the values ('563e10ba-1a4c-4416-81aa-dcf9dcbb6662')

@vivekmenezes
Copy link
Contributor

@leomkkwan the INSERT should not be running through the code path that is hitting this panic. Nor would the SELECT. Are there any other queries being run on this system? Perhaps the statements page on the admin UI would provide you with answers? Thanks for reporting this issue.

@jordanlewis
Copy link
Member

Vivek, I do think the select would run through this code path. I can show you more on Monday.

@leomkkwan
Copy link
Author

There were only two tasks running when this happened. Insert from npgsql C# client and select statement from Dbeaver client.

@vivekmenezes
Copy link
Contributor

@leomkkwan can you share the schema with us using SHOW CREATE TABLE screenview? I'd like to reproduce this on our end.

You can confirm that select * from universe.screenview where fp = '563e10ba-1a4c-4416-81aa-dcf9dcbb6662' is returning a lastmodified value right?

It would also be nice to see the output of EXPLAIN select * from universe.screenview where pp= '129d5825-f5b9-4535-bcad-5aaf2ad85bba'

Thanks!

@leomkkwan
Copy link
Author

CREATE TABLE screenview (
fp UUID NOT NULL DEFAULT gen_random_uuid(),
tp UUID NOT NULL,
pp UUID NOT NULL,
nfp UUID NULL,
pfp UUID NULL,
date TIMESTAMPTZ NOT NULL DEFAULT now(),
lastmodified TIMESTAMPTZ NOT NULL DEFAULT now(),
ip INET NULL,
landingurl STRING NULL,
referurl STRING NULL,
agent STRING NULL,
sa0 STRING NULL,
sa1 STRING NULL,
sa2 STRING NULL,
sa3 STRING NULL,
sa4 STRING NULL,
sa5 STRING NULL,
sa6 STRING NULL,
sa7 STRING NULL,
sa8 STRING NULL,
sa9 STRING NULL,
payload JSONB NULL,
CONSTRAINT "primary" PRIMARY KEY (fp ASC),
FAMILY "primary" (fp, tp, pp, date, ip, landingurl, referurl, agent, sa0, sa1, sa2, sa3, sa4, sa5, sa6, sa7, sa8, sa9, payload),
FAMILY f1 (nfp, pfp, lastmodified)
)

You can confirm that select * from universe.screenview where fp = '563e10ba-1a4c-4416-81aa-dcf9dcbb6662' is returning a lastmodified value right? Confirmed.

EXPLAIN select * from universe.screenview where pp= '129d5825-f5b9-4535-bcad-5aaf2ad85bba'
scan
table screenview@primary
spans ALL
A side note as the above query is searching pp field, which is not the primary key.

@solongordon
Copy link
Contributor

Actually it looks like this is simply a case of TableReader not draining when it encounters an error from the rowfetcher. This causes it to continue Nexting the rowfetcher, which is now in a bad state, and generating errors in a tight loop. I think this got broken when stats were added to tablereader early in this release cycle.

I'll fix this and also do an audit for other places where processors might not be draining when a downstream error is encountered.

@jordanlewis
Copy link
Member

Hmm, it occurs to me that I don't directly understand how the missing drain call could cause this panic - do you? If you call NextRow after this error condition, can you see how it could hit the panic?

solongordon added a commit to solongordon/cockroach that referenced this issue Sep 12, 2018
I found several places where processors were not checking for error
metadata from their inputs and draining appropriately. The most
egregious of these is in TableReader, which was not calling
MoveToDraining when it encountered an error from the underlying
RowFetcher. This meant that rather than draining it would continue
calling NextRow and generating errors in a tight loop, which caused the
query to hang as well as high CPU and memory usage on the affected node.

Other affected processors are Distinct, SortChunks, and SortTopK.

Fixes cockroachdb#29374
Fixes cockroachdb#29978

Release note: None
solongordon added a commit to solongordon/cockroach that referenced this issue Sep 12, 2018
I found several places where processors were not checking for error
metadata from their inputs and draining appropriately. The most
egregious of these is in TableReader, which was not calling
MoveToDraining when it encountered an error from the underlying
RowFetcher. This meant that rather than draining it would continue
calling NextRow and generating errors in a tight loop, which caused the
query to hang as well as high CPU and memory usage on the affected node.

Other affected processors are Distinct, SortChunks, and SortTopK.

Fixes cockroachdb#29374
Fixes cockroachdb#29978

Release note: None
craig bot pushed a commit that referenced this issue Sep 12, 2018
30096: distsql: add missing MoveToDraining calls r=solongordon a=solongordon

I found several places where processors were not checking for error
metadata from their inputs and draining appropriately. The most
egregious of these is in TableReader, which was not calling
MoveToDraining when it encountered an error from the underlying
RowFetcher. This meant that rather than draining it would continue
calling NextRow and generating errors in a tight loop, which caused the
query to hang as well as high CPU and memory usage on the affected node.

Other affected processors are Distinct, SortChunks, and SortTopK.

Fixes #29374
Fixes #29978

Release note: None

Co-authored-by: Solon Gordon <solon@cockroachlabs.com>
@craig craig bot closed this as completed in #30096 Sep 12, 2018
@solongordon
Copy link
Contributor

@jordanlewis Yeah, I have not had any luck reproducing the panic. It seems possible that continuing to call NextRow after an error could trigger the panic. Something similar happened in my repro, where initially the ReadWithinUncertaintyIntervalError was thrown but a different error (single entry value with no default column id) was thrown on subsequent NextRow calls.

@leomkkwan, would you be willing to retest with my fix to see if the panic persists?

@solongordon solongordon reopened this Sep 12, 2018
@leomkkwan
Copy link
Author

@solongordon Do you have linux binary which include your fix so I can test?

@solongordon
Copy link
Contributor

Yes, you can get a bleeding edge binary via https://edge-binaries.cockroachdb.com/cockroach/cockroach.linux-gnu-amd64.LATEST

@leomkkwan
Copy link
Author

I'm testing with the latest binary. The server did not go panic, but the first few tried return the following error. After like a minute, everything looks good.

Will keep testing on this, but just want to let you know what I found.

SQL Error [40001]: ERROR: restart transaction: HandledRetryableTxnError: ReadWithinUncertaintyIntervalError: read at time 1536854492.069197097,0 encountered previous write with future timestamp 1536854492.148524676,0 within uncertainty interval t <= 1536854492.387521394,0; observed timestamps: [{2 1536854492.069197097,0} {3 1536854493.137278554,0} {4 1536854492.387521394,0} {5 1536854492.070576462,0}]
ERROR: restart transaction: HandledRetryableTxnError: ReadWithinUncertaintyIntervalError: read at time 1536854492.069197097,0 encountered previous write with future timestamp 1536854492.148524676,0 within uncertainty interval t <= 1536854492.387521394,0; observed timestamps: [{2 1536854492.069197097,0} {3 1536854493.137278554,0} {4 1536854492.387521394,0} {5 1536854492.070576462,0}]
ERROR: restart transaction: HandledRetryableTxnError: ReadWithinUncertaintyIntervalError: read at time 1536854492.069197097,0 encountered previous write with future timestamp 1536854492.148524676,0 within uncertainty interval t <= 1536854492.387521394,0; observed timestamps: [{2 1536854492.069197097,0} {3 1536854493.137278554,0} {4 1536854492.387521394,0} {5 1536854492.070576462,0}]

@jordanlewis
Copy link
Member

I think I may have tracked down the root cause of this finally.

Here's a log from kvfetcher right before a similar panic:

I180917 22:25:36.235901 9420 sql/sqlbase/kvfetcher.go:306  [n6] Scan /Table/13/1/2018-0{6-22T03:20:05.532895Z/359029819147124737-8-14T11:51:32.262383Z/374135496300593156}, /Table/13/{1/2018-08-29T04:31:55.204231Z/378295796590739458-2}
I180917 22:25:36.460483 9420 sql/sqlbase/kvfetcher.go:306  [n6] Scan /Table/13/1/2018-0{7-19T23:24:46.368931Z/366910788100030465/5/1-8-14T11:51:32.262383Z/374135496300593156}, /Table/13/{1/2018-08-29T04:31:55.204231Z/378295796590739458-2}
panic: Non-nullable column "rangelog:rangeID" with no value! Index scanned was "primary" with the index key columns (timestamp,uniqueID) and the values ('2018-07-19 23:24:46.368931+00:00',366910788100030465)

goroutine 9420 [running]:
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*RowFetcher).finalizeRow(0xc428849490, 0x2f941a0, 0xc42727ce40)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/rowfetcher.go:1261 +0x9d9
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*RowFetcher).NextRow(0xc428849490, 0x2f941a0, 0xc42727ce40, 0x74b576, 0x5ba029e0, 0xc41be8feaa, 0x3c2ec08af8, 0xc428cc1e78, 0xcca136, 0xbee028581be8feaa)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/rowfetcher.go:1013 +0x183
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*rowFetcherWrapper).Next(0xc427843ae0, 0xed33220e0, 0x0, 0xc42673b1e0, 0x7)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/tablereader.go:141 +0x4a

If you look closely at the second Scan line, you can see that the start key of the first component of the scan is starting from the 5th column family of the table. This seems like a lead, because the rangeID column that causes the panic is inside of the 2nd column family. Normally I think it should be okay to start from the middle of a column family if the cause of the second scan was pagination, since the kvfetcher is supposed to keep state, including the last keys from the last fetch.

In this case, though, there's a second span in the span list - I have a feeling this is tripping up some code and causing the kv/rowfetcher to lose track of the keys it saw in the last batch. To soothe my own mind, I verified that the other user that saw this issue saw it before #26553 was merged, which is the change that made MVCCScan return byte batches instead of slices of KeyValue.

@jordanlewis
Copy link
Member

Good news, I managed to reproduce this in a logic test.

# LogicTest: 5node-dist-opt

statement ok
CREATE TABLE p1 (
  a INT NOT NULL,
  b INT NOT NULL DEFAULT 0,
  c INT NOT NULL DEFAULT 0,
  d INT NOT NULL DEFAULT 0,
  PRIMARY KEY (a),
  FAMILY (a),
  FAMILY (b),
  FAMILY (c),
  FAMILY (d)
)

statement ok
INSERT INTO p1(a) SELECT * FROM generate_series(1, 1000)

statement ok
SET CLUSTER SETTING kv.range_merge.queue_enabled = false;

statement ok
ALTER TABLE p1 SPLIT AT SELECT i*100 FROM generate_series(1,10) as g(i)

statement ok
ALTER TABLE p1 EXPERIMENTAL_RELOCATE SELECT ARRAY[((i - 1) % 5) + 1], i FROM generate_series(1,10) AS g(i)

statement ok
ALTER TABLE p1 SCATTER

query IIII
SELECT * FROM p1
----
I180917 23:40:48.003068 55 sql/logictest/logic.go:880  SELECT * FROM p1;
I180917 23:40:48.011986 2248 sql/sqlbase/kvfetcher.go:306  [n2] Scan /Table/20/1/5{3-4}
I180917 23:40:48.014098 2290 sql/sqlbase/kvfetcher.go:306  [n2] Scan /Table/53/1{-/100}
I180917 23:40:48.014167 2169 sql/sqlbase/kvfetcher.go:306  [n5] Scan /Table/53/1/{300-400}
I180917 23:40:48.014372 2290 sql/sqlbase/kvfetcher.go:306  [n2] Scan /Table/53/1/1{/1/1-00}
I180917 23:40:48.014415 2270 sql/sqlbase/kvfetcher.go:306  [n1,client=127.0.0.1:60381,user=root] Scan /Table/53/1/{100-300}, /Table/53/{1/400-2}
I180917 23:40:48.014680 2270 sql/sqlbase/kvfetcher.go:306  [n1,client=127.0.0.1:60381,user=root] Scan /Table/53/1/{100/1/1-300}, /Table/53/{1/400-2}
panic: Non-nullable column "p1:b" with no value! Index scanned was "primary" with the index key columns (a) and the values (100)

goroutine 2270 [running]:
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*RowFetcher).finalizeRow(0xc003572490, 0x654e520, 0xc002f4eea0)
        /Users/jordan/repo/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/rowfetcher.go:1255 +0xa22
github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*RowFetcher).NextRow(0xc003572490, 0x654e520, 0xc002f4eea0, 0x2, 0x2, 0x1, 0x10001, 0x0, 0x0, 0x0)
        /Users/jordan/repo/src/github.com/cockroachdb/cockroach/pkg/sql/sqlbase/rowfetcher.go:1007 +0x19a
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*rowFetcherWrapper).Next(0xc0003e35e0, 0x654e520, 0xc002f4eea0, 0xc0049e1560, 0xc00253aea0)
        /Users/jordan/repo/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/tablereader.go:141 +0x4a
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*tableReader).Next(0xc003572000, 0x0, 0x654e520, 0x654e460, 0xc002f4eea0)
        /Users/jordan/repo/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/tablereader.go:242 +0x4c
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.Run(0x654e460, 0xc0020bc600, 0x6557920, 0xc003572000, 0x65313e0, 0xc004086380)
        /Users/jordan/repo/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:170 +0x35
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).Run(0xc003572000, 0x654e460, 0xc0020bc600, 0xc002750888)
        /Users/jordan/repo/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:731 +0x96
created by github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).startInternal
        /Users/jordan/repo/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:550 +0x2f2
make: *** [testbaselogic] Error 2

@jordanlewis
Copy link
Member

Oh, actually, I also have a code modification to enable this to panic: unconditionally setting the limitHint of StartScan from tableReader to 1 to try to provoke this quickly without a lot of data.

diff --git a/pkg/sql/distsqlrun/tablereader.go b/pkg/sql/distsqlrun/tablereader.go
index 56ba193fd6..c113f5fa2c 100644
--- a/pkg/sql/distsqlrun/tablereader.go
+++ b/pkg/sql/distsqlrun/tablereader.go
@@ -229,7 +229,7 @@ func (tr *tableReader) Start(ctx context.Context) context.Context {
        tr.input.Start(fetcherCtx)
        if err := tr.fetcher.StartScan(
                fetcherCtx, tr.flowCtx.txn, tr.spans,
-               true /* limit batches */, tr.limitHint, tr.flowCtx.traceKV,
+               true /* limit batches */, 1, tr.flowCtx.traceKV,
        ); err != nil {
                tr.MoveToDraining(err)
        }

@jordanlewis
Copy link
Member

OK, I found the problem - it's a very simple mistake. The RowFetcher mistakenly assumes that a row can't span 2 kv batches. As soon as a new batch occurs (maybeNewSpan is set to true from the kv fetcher), RowFetcher sets its internal "row done" state to true. This will cause problems when a batch inevitably ends in the middle of a row, which isn't prohibited anywhere.

My main question is why none of our tests noticed this. Perhaps we should be testing with small batch sizes everywhere. And multiple column families.

@jordanlewis
Copy link
Member

I believe this was introduced by #23474, so it wasn't present in 2.0 at least.

craig bot pushed a commit that referenced this issue Sep 19, 2018
30394: sqlbase: bugfix in rowfetcher with multiple cfs r=jordanlewis a=jordanlewis

A bug (introduced during the 2.1 cycle) caused scans issued via distsql
against tables with multiple column families to return incorrect results
or panic the server if some of the column families were marked as
non-nullable. This bug could only occur in distsql because it only
manifested when a rowfetcher was initialized with multiple input spans,
which only occurs in distsql where a particular node owns 2
non-contiguous portions of the keyspace. In this case, the code would
misinterpret the empty response from the scan request on the second
input span to mean that there was a limit on the last scan and that
we might have entered a new input span, which caused the rowfetcher to
mark the current row as ended and expect a full row starting from the
next key retrieved.

This was not the case, as this could happen in normal circumstances.
When a row had multiple column families, the rowfetcher would fail to
remember the results it had already seen for the earlier column families
in a row, if a batch happened to end half way through the row's column
families, and return incorrect and/or panic depending on the definition
of the table.

This patch removes this problem and adds a test for it. The test is a
little complicated but I verified that it does in fact panic before the
patch. It should catch problems like this in the future. We had no
explicit tests of the rowfetcher before with multiple input spans.

Note that this patch also deletes a test case, that tests correct
behavior in the case of seeing duplicate rows from the input stream
during an index join. This test case is contrived - index joins can
never encounter duplicate rows from the input stream - so it's safe to
delete.

Closes #29374.
Partially reverts #23474. The code no longer has protection against the condition that that PR was trying to guard against. But I don't think this is a problem in practice. If it crops up in new code it'll be very easy to notice and fix at that point.

cc @vivekmenezes

Release note: None

30423: ui: fix default jobs page sort r=couchand a=couchand

In #18137 a column was added, but the default sort
(which is specified by index) was not updated.  This
updates it to correctly sort on creation time.

Fixes: #30418
Release note (admin ui change): Fixes an issue where
the default sort on the Jobs page was incorrectly
the User column rather than the Creation Time.

Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com>
Co-authored-by: Andrew Couch <andrew@cockroachlabs.com>
@craig craig bot closed this as completed in #30394 Sep 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants