-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic: Non-nullable column with no value! #29374
Comments
@leomkkwan did you add that column as part of a schema change, or was it there from the first |
I have that column from the first create table statement; however, when the panic pops, I was adding index to another table under the same database. Not sure if that related. That alter table job get killed by the system and a roll back task created automatically, which seems to me that rollback is stuck. I will create another ticket if that rollback job really stuck. |
@leomkkwan thanks for reporting this problem. Can you share with us the output of |
Show jobs output ` 374446927381528578 | SCHEMA CHANGE | CREATE INDEX ON filu.public.t (aid) STORING (vid) | leo | canceled | 2018-08-15 14:15:33.591238+00:00 | 2018-08-15 14:15:33.70979+00:00 | 2018-08-28 14:21:32.679021+00:00 | 2018-08-28 14:21:32.679022+00:00 | 0.0788539797067642 | | NULL 374470972645703682 | SCHEMA CHANGE | CREATE INDEX ON filu.public.t (aid) STORING (vid) | leo | canceled | 2018-08-15 16:17:51.627163+00:00 | NULL | 2018-08-15 17:27:22.350865+00:00 | 2018-08-17 08:48:47.430118+00:00 | 0 | | NULL 374834006255828996 | SCHEMA CHANGE | ROLL BACK CREATE INDEX ON filu.public.t (aid) STORING (vid) | leo | succeeded | 2018-08-16 23:04:20.698192+00:00 | 2018-08-16 23:04:20.722213+00:00 | 2018-08-17 08:45:12.546743+00:00 | 2018-08-17 08:45:12.546744+00:00 | 1 | | NULL 374948913002020868 | SCHEMA CHANGE | ROLL BACK CREATE INDEX ON filu.public.t (aid) STORING (vid) | leo | succeeded | 2018-08-17 08:48:47.454237+00:00 | 2018-08-17 08:48:47.478727+00:00 | 2018-08-17 08:53:34.756627+00:00 | 2018-08-17 08:53:34.756628+00:00 | 1 | | NULL 378131188709654529 | SCHEMA CHANGE | CREATE INDEX ON filu.public.t (aid) STORING (vid) | root | failed | 2018-08-28 14:34:40.922791+00:00 | 2018-08-28 14:34:41.020696+00:00 | 2018-08-30 18:14:18.347473+00:00 | 2018-08-30 18:14:18.347474+00:00 | 0.155584752559662 | [n1] communication error: rpc error: code = Canceled desc = context canceled | NULL 378740599575347201 | SCHEMA CHANGE | ROLL BACK CREATE INDEX ON filu.public.t (aid) STORING (vid) | root | succeeded | 2018-08-30 18:14:18.314705+00:00 | 2018-08-30 18:14:18.412797+00:00 | 2018-08-31 12:09:41.857125+00:00 | 2018-08-31 12:09:41.857127+00:00 | 1 | | NULL The first 4 lines was from previous alpha version, which I mentioned in this issue #28784 I am not importing anything. What I did is use CRDB as a log for a web server, so there was a lot of concurrent signal row insert statement to that table. The statement is like this insert into screenview (fp,tp,pp,landingurl,agent,ip,referurl,payload) values (@fp,@tp,@pp,@landingurl,@agent,@ip,@referurl,@payload) RETURNING NOTHING; On a side note, the "panic" issue was happened only once, however, a random node will go 00m and get killed if I run the above test. |
The CREATE INDEX schema change failure you saw has been fixed through #28769 However what you are seeing a panic which shouldn't be happening. So the evidence suggests that this has nothing to do with IMPORT or schema changes because the schema changes involved were CREATE INDEX and were not modifying any columns. Thanks for reporting this issue, it requires further investigation. I'll continue investigating this. |
@leomkkwan can you run |
Here is the output fp tp pp date lastmodified payload
The row have been successfully insert. |
@leomkkwan this seems to indicate that the INSERT succeeded correctly and some SELECT request crashed with the above panic. Do you know what request this might be? Usually the request that paniced is in the logs. Thanks |
@vivekmenezes May I email you the log file so you can take a look? |
my email is vivek@cockroachlabs.com . Please do not send me any sensitive information. Thanks |
Hi Vivek, I've sent you the log file and it doesn't contain any sensitive information. |
I'm now betting that this crash is happening in the index backfiller. The index backfiller runs through a scan of the table and it hits this panic. Normally panics caused by statements result in the panic showing up logging an anonymized version of the statement. |
I don't think it's the index backfiller. You'd see that in the stack trace. The reason that there's no statement printed is that this is a flow on a remote node. You can tell because the bottom of the stack trace is from |
The server is upgraded to v2.1.0-beta.20180904. I can replicate this panic by running the following query when the concurrent insert is happening. select * from universe.screenview where pp= '129d5825-f5b9-4535-bcad-5aaf2ad85bba' Where '129d5825-f5b9-4535-bcad-5aaf2ad85bba' is exists and the panic message will be for another currently inserting row. panic: Non-nullable column "screenview:lastmodified" with no value! Index scanned was "primary" with the index key columns (fp) and the values ('563e10ba-1a4c-4416-81aa-dcf9dcbb6662') |
@leomkkwan the INSERT should not be running through the code path that is hitting this panic. Nor would the SELECT. Are there any other queries being run on this system? Perhaps the statements page on the admin UI would provide you with answers? Thanks for reporting this issue. |
Vivek, I do think the select would run through this code path. I can show you more on Monday. |
There were only two tasks running when this happened. Insert from npgsql C# client and select statement from Dbeaver client. |
@leomkkwan can you share the schema with us using SHOW CREATE TABLE screenview? I'd like to reproduce this on our end. You can confirm that It would also be nice to see the output of Thanks! |
CREATE TABLE screenview ( You can confirm that select * from universe.screenview where fp = '563e10ba-1a4c-4416-81aa-dcf9dcbb6662' is returning a lastmodified value right? Confirmed. EXPLAIN select * from universe.screenview where pp= '129d5825-f5b9-4535-bcad-5aaf2ad85bba' |
Actually it looks like this is simply a case of I'll fix this and also do an audit for other places where processors might not be draining when a downstream error is encountered. |
Hmm, it occurs to me that I don't directly understand how the missing drain call could cause this panic - do you? If you call |
I found several places where processors were not checking for error metadata from their inputs and draining appropriately. The most egregious of these is in TableReader, which was not calling MoveToDraining when it encountered an error from the underlying RowFetcher. This meant that rather than draining it would continue calling NextRow and generating errors in a tight loop, which caused the query to hang as well as high CPU and memory usage on the affected node. Other affected processors are Distinct, SortChunks, and SortTopK. Fixes cockroachdb#29374 Fixes cockroachdb#29978 Release note: None
I found several places where processors were not checking for error metadata from their inputs and draining appropriately. The most egregious of these is in TableReader, which was not calling MoveToDraining when it encountered an error from the underlying RowFetcher. This meant that rather than draining it would continue calling NextRow and generating errors in a tight loop, which caused the query to hang as well as high CPU and memory usage on the affected node. Other affected processors are Distinct, SortChunks, and SortTopK. Fixes cockroachdb#29374 Fixes cockroachdb#29978 Release note: None
30096: distsql: add missing MoveToDraining calls r=solongordon a=solongordon I found several places where processors were not checking for error metadata from their inputs and draining appropriately. The most egregious of these is in TableReader, which was not calling MoveToDraining when it encountered an error from the underlying RowFetcher. This meant that rather than draining it would continue calling NextRow and generating errors in a tight loop, which caused the query to hang as well as high CPU and memory usage on the affected node. Other affected processors are Distinct, SortChunks, and SortTopK. Fixes #29374 Fixes #29978 Release note: None Co-authored-by: Solon Gordon <solon@cockroachlabs.com>
@jordanlewis Yeah, I have not had any luck reproducing the panic. It seems possible that continuing to call @leomkkwan, would you be willing to retest with my fix to see if the panic persists? |
@solongordon Do you have linux binary which include your fix so I can test? |
Yes, you can get a bleeding edge binary via https://edge-binaries.cockroachdb.com/cockroach/cockroach.linux-gnu-amd64.LATEST |
I'm testing with the latest binary. The server did not go panic, but the first few tried return the following error. After like a minute, everything looks good. Will keep testing on this, but just want to let you know what I found. SQL Error [40001]: ERROR: restart transaction: HandledRetryableTxnError: ReadWithinUncertaintyIntervalError: read at time 1536854492.069197097,0 encountered previous write with future timestamp 1536854492.148524676,0 within uncertainty interval |
I think I may have tracked down the root cause of this finally. Here's a log from
If you look closely at the second In this case, though, there's a second span in the span list - I have a feeling this is tripping up some code and causing the kv/rowfetcher to lose track of the keys it saw in the last batch. To soothe my own mind, I verified that the other user that saw this issue saw it before #26553 was merged, which is the change that made MVCCScan return byte batches instead of slices of KeyValue. |
Good news, I managed to reproduce this in a logic test.
|
Oh, actually, I also have a code modification to enable this to panic: unconditionally setting the
|
OK, I found the problem - it's a very simple mistake. The My main question is why none of our tests noticed this. Perhaps we should be testing with small batch sizes everywhere. And multiple column families. |
I believe this was introduced by #23474, so it wasn't present in 2.0 at least. |
30394: sqlbase: bugfix in rowfetcher with multiple cfs r=jordanlewis a=jordanlewis A bug (introduced during the 2.1 cycle) caused scans issued via distsql against tables with multiple column families to return incorrect results or panic the server if some of the column families were marked as non-nullable. This bug could only occur in distsql because it only manifested when a rowfetcher was initialized with multiple input spans, which only occurs in distsql where a particular node owns 2 non-contiguous portions of the keyspace. In this case, the code would misinterpret the empty response from the scan request on the second input span to mean that there was a limit on the last scan and that we might have entered a new input span, which caused the rowfetcher to mark the current row as ended and expect a full row starting from the next key retrieved. This was not the case, as this could happen in normal circumstances. When a row had multiple column families, the rowfetcher would fail to remember the results it had already seen for the earlier column families in a row, if a batch happened to end half way through the row's column families, and return incorrect and/or panic depending on the definition of the table. This patch removes this problem and adds a test for it. The test is a little complicated but I verified that it does in fact panic before the patch. It should catch problems like this in the future. We had no explicit tests of the rowfetcher before with multiple input spans. Note that this patch also deletes a test case, that tests correct behavior in the case of seeing duplicate rows from the input stream during an index join. This test case is contrived - index joins can never encounter duplicate rows from the input stream - so it's safe to delete. Closes #29374. Partially reverts #23474. The code no longer has protection against the condition that that PR was trying to guard against. But I don't think this is a problem in practice. If it crops up in new code it'll be very easy to notice and fix at that point. cc @vivekmenezes Release note: None 30423: ui: fix default jobs page sort r=couchand a=couchand In #18137 a column was added, but the default sort (which is specified by index) was not updated. This updates it to correctly sort on creation time. Fixes: #30418 Release note (admin ui change): Fixes an issue where the default sort on the Jobs page was incorrectly the User column rather than the Creation Time. Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com> Co-authored-by: Andrew Couch <andrew@cockroachlabs.com>
Running v2.1.0-beta.20180827 on a 5 nodes cluster. One of the node crash with this panic message.
The column screenview:lastmodified was defined with default value.
lastmodified TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now()
The text was updated successfully, but these errors were encountered: