Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent failure with Data Explorer smoke tests #4663

Open
dfalbel opened this issue Sep 12, 2024 · 3 comments
Open

Intermittent failure with Data Explorer smoke tests #4663

dfalbel opened this issue Sep 12, 2024 · 3 comments
Assignees
Labels
area: data explorer Issues related to Data Explorer category. bug Something isn't working lang: python testing Unit, extension-level, and smoke tests

Comments

@dfalbel
Copy link
Contributor

dfalbel commented Sep 12, 2024

System details:

Currently only reproducible on CI with commits after merging #4326.

Positron and OS details:

Interpreter details:

Python/Polars

Describe the issue:

Since merging #4326 we're seeing intermittend failures with the data explorer smoke tests. Most of the times, it happened while openining a very small Polars data frame. Resulting in:

playwright-screenshot-1-Python_Polars_-_Verifies_basic_data_explorer_functionality__C644538___pr

From the console logs we see:

ERR Invalid response from 46ffec56-ce4b-4b0c-94bb-8b2faf1b530e: no 'result' field. (response = {"jsonrpc":"2.0","method":"return_column_profiles","params":{"callback_id":"db018346-19ba-4d40-a718-ec395fa5fac5","profiles":[{"null_count":0,"summary_stats":null,"small_histogram":null,"large_histogram":null,"small_frequency_table":null,"large_frequency_table":null},{"null_count":0,"summary_stats":null,"small_histogram":null,"large_histogram":null,"small_frequency_table":null,"large_frequency_table":null},{"null_count":0,"summary_stats":null,"small_histogram":null,"large_histogram":null,"small_frequency_table":null,"large_frequency_table":null},{"null_count":1,"summary_stats":null,"small_histogram":null,"large_histogram":null,"small_frequency_table":null,"large_frequency_table":null},{"null_count":1,"summary_stats":null,"small_histogram":null,"large_histogram":null,"small_frequency_table":null,"large_frequency_table":null},{"null_count":1,"summary_stats":null,"small_histogram":null,"large_histogram":null,"small_frequency_table":null,"large_frequency_table":null}]}})
ERR get_column_profiles timed out after 10 seconds: Error: get_column_profiles timed out after 10 seconds
    at vscode-file://vscode-app/home/runner/work/positron/positron/out/vs/workbench/services/languageRuntime/common/languageRuntimeDataExplorerClient.js:294:35

This suggests that the return_column_profiles event was intepreted as an RPC response, thus causing 1. the get_column_profiles RPC call to fail. 2. The results are not treated as an event, and thus, they are not correctly forwarded to the renderer.

This could happen when the return_column_profiles message is sent before we actually sent the reply to get_column_profiles RPC. We added a mechanism in the Python code to prevent this to happen in 14ae1ad, and while it seems to prevent messages from being sent out of order. However, unless we're missing something, my hypothesis is that there's no guarantee that Positron will process them in the correct order.

The root cause of this issue is that when the ipykernel sends a comm.send_event(), it actually adds the parent headers of the last request - which is technically wrong - events should not have parents message ids. It's possible to change this behavior, but requires a good amount of work.

So AFAICT what's happening is that the Jupyter adapter forwards the event message to the RPC handler:

if (this._pendingRpcs.has(msg.originId)) {

And causes the RPC error that's unexpected by the data explorer.

Steps to reproduce the issue:

Only seen on CI so far. Se attached artifacts.

run-artifacts-13.zip

Expected or desired behavior:

No failures, data should be displayed

Were there any error messages in the UI, Output panel, or Developer Tools console?

See above

@dfalbel dfalbel added bug Something isn't working area: data explorer Issues related to Data Explorer category. lang: python labels Sep 12, 2024
@juliasilge juliasilge added the testing Unit, extension-level, and smoke tests label Sep 12, 2024
midleman added a commit that referenced this issue Sep 12, 2024
### Intent
This test has recently become flakey due to
[this](#4663) known issue.
In the mean time, we are disabling it to prevent blocking of pipelines.

### QA Notes
N/A
@wesm
Copy link
Contributor

wesm commented Sep 12, 2024

A possible fix is that on the Positron side we let through messages that don't have the result field, is there a consistent way to detect whether it's a frontend method being called with send_event?

@dfalbel
Copy link
Contributor Author

dfalbel commented Sep 12, 2024

AFAICT, everything that is not an RPC reply is treated as an event, ie, there's no strong definition of what an 'event' is, see eg:

if (message.parent_id && this._pendingRpcs.has(message.parent_id)) {
// This is a response to an RPC call; resolve the deferred promise.
const promise = this._pendingRpcs.get(message.parent_id);
promise?.complete(message);
this._pendingRpcs.delete(message.parent_id);
} else {
// This is a regular message; emit it to the client as an event.
this._dataEmitter.fire({ data: message.data as Output, buffers: message.buffers });
}
}

if (this._pendingRpcs.has(msg.originId)) {
const promise = this._pendingRpcs.get(msg.originId)!;
if (promise.settled) {
// If the promise is already settled in some way (resolved or
// rejected) we can't do anything with it, so just log a
// warning.
console.warn(`Ignoring RPC response for ${msg.originId}; ` +
`RPC already settled (timed out?)`);
} else {
// Otherwise, resolve the promise with the message data.
promise.resolve({ data: message.data, buffers: msg.buffers });
}
// Remove the RPC from the pending list
this._pendingRpcs.delete(msg.originId);
} else {
// Not an RPC response, so emit the message
this._messageEmitter.fire({ data: message.data, buffers: msg.buffers });
}

@wesm
Copy link
Contributor

wesm commented Sep 12, 2024

if (message.parent_id && this._pendingRpcs.has(message.parent_id)) {
			// This is a response to an RPC call; resolve the deferred promise.

Here I think we would want to add a check that the message contains a result field, to know whether it is actually a response to an RPC as opposed to an event that was sent while the RPC response was pending

@juliasilge juliasilge added this to the 2024.10.0 Pre-Release milestone Sep 16, 2024
jmcphers added a commit that referenced this issue Sep 18, 2024
Speculative change to address
#4663.

As noted in that issue, this is mostly a workaround for an issue that
exists in the Python kernel wherein comm messages emitted always have
the parent ID of the most recent message to the shell (even if they are
not a reply to that message). This can cause problems in Positron since
it assumes that, after a message is sent, the next message with a
matching `parent_id` is the reply to that message.

The fix is to check a message for `result` or `error` keys before
treating it as an RPC response. We can't do this for _all_ messages
since many codepaths/comms do not use the JSON-RPC structure, but we do
it for all formally defined Positron comms.

This is admittedly a little janky. Some things that could make this
better (but are higher risk than this change):
- a more robust way of identifying Positron's JSON-RPC comm messages
- having Positron supply its own ID with each request that must be
returned in the reply body in order to deterministically pair requests
and replies (i.e. don't rely on `parent_id` since it can be hard to
control)
- moving Positron comms off the shell socket (@lionel- has suggested
e.g. moving them to the control socket to allow them to operate
independently of the current shell command)

### QA Notes

This change is small but it's a hot codepath -- almost every comm
message goes through here. Sanity test the data explorer, and
ipywidgets. They use this codepath but specifically do not want the new
behavior added here.
midleman pushed a commit that referenced this issue Sep 19, 2024
Speculative change to address
#4663.

As noted in that issue, this is mostly a workaround for an issue that
exists in the Python kernel wherein comm messages emitted always have
the parent ID of the most recent message to the shell (even if they are
not a reply to that message). This can cause problems in Positron since
it assumes that, after a message is sent, the next message with a
matching `parent_id` is the reply to that message.

The fix is to check a message for `result` or `error` keys before
treating it as an RPC response. We can't do this for _all_ messages
since many codepaths/comms do not use the JSON-RPC structure, but we do
it for all formally defined Positron comms.

This is admittedly a little janky. Some things that could make this
better (but are higher risk than this change):
- a more robust way of identifying Positron's JSON-RPC comm messages
- having Positron supply its own ID with each request that must be
returned in the reply body in order to deterministically pair requests
and replies (i.e. don't rely on `parent_id` since it can be hard to
control)
- moving Positron comms off the shell socket (@lionel- has suggested
e.g. moving them to the control socket to allow them to operate
independently of the current shell command)

### QA Notes

This change is small but it's a hot codepath -- almost every comm
message goes through here. Sanity test the data explorer, and
ipywidgets. They use this codepath but specifically do not want the new
behavior added here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: data explorer Issues related to Data Explorer category. bug Something isn't working lang: python testing Unit, extension-level, and smoke tests
Projects
None yet
Development

No branches or pull requests

4 participants