[query] Avoid py4j for python-backend interactions #13797

daniel-goldstein · 2023-10-11T21:05:53Z

CHANGELOG: Fixes #13756: operations that collect large results such as to_pandas may require up to 3x less memory.

This turns all "actions", i.e. backend methods supported by QoB into HTTP endpoints on the spark and local backends. This intentionally avoids py4j because py4j was really designed to pass function names and references around and does not handle large payloads well (such as results from a collect). Specifically, py4j uses a text-based protocol on top of TCP that substantially inflates the memory requirement for communicating large byte arrays. On the Java side, py4j serializes every binary payload as a Base64-encoded java.lang.String, which between the Base64 encoding and String's use of UTF-16 results in a memory footprint of the String being 4/3 * 2 = 8/3 nearly three times the size of the byte array on either side of the py4j pipe. py4j also appears to do an entire copy of this payload, which means nearly a 6x memory requirement for sending back bytes. Using our own socket means we can directly send back the response bytes to python without any of this overhead, even going so far as to encode results directly into the TCP output stream. Formalizing the API between python and java also allows us to reuse the same payload schema across all three backends.

danking · 2023-10-16T16:34:56Z

I can start looking at this as soon as you give me the OK

daniel-goldstein · 2023-10-17T14:21:09Z

@danking Looks like I'm still failing to configure a couple of settings related to references on the ServiceBackend but you can feel free to start looking. You'll notice that I made quite a substantial refactor in ServiceBackend.scala in an attempt to harmonize the scala backends a bit more. The rationale behind the refactor is I was having a hard time working with the various thunks passed around there. I saw them as a bit of poor-man's-object way to capture some state from the input file while keeping the ServiceBackend stateless. IMO there's no harm in keeping the ServiceBackend just as stateful as the other backends since it is single use. So I lifted a lot of that state into backend-creation time and created a harder delineation between which part of the input is for configuring the backend and which part is for the action being performed. This made it easier to reuse a couple of methods like tableType and such.

I'm happy to take suggestions on ways to trim down this PR, but I thought you'd want to take a look at the whole thing given the time-sensitivity.

danking · 2023-10-17T16:16:37Z

No worries, I'll review this as is!

…e other backends

danking

A few thoughts; I'll have to do another closer read but this looks awesome so far. I'm really glad we moved away from my silly binary protocol to a JSON based one.

hail/python/hail/backend/backend.py

danking · 2023-10-18T22:25:34Z

hail/python/hail/backend/spark_backend.py

@@ -268,7 +273,18 @@ def hail_package(self):
    def utils_package_object(self):
        return self._utils_package_object

+    def _rpc(self, action, payload) -> Tuple[bytes, str]:


This can live on Py4JBackend, right?

I see that you depend on _backend_server. You could pass the _jbackend into Py4JBackend's __init__ and then you can construct _backend_server in Py4JBackend.

Good call. I took it several steps further and also lifted a bunch of other duplication into Py4JBackend. Should be limited to just commit aa3ffcc if you want to look at that separately.

danking · 2023-10-18T22:28:32Z

hail/python/hail/backend/py4j_backend.py

+from .backend import ActionTag, Backend, fatal_error_from_java_error_triplet
+
+import http.client
+http.client._MAXLINE = 2 ** 20


what's this about?

I left a comment explaining, and here's an SO post. I'm not really sure what to do here. It felt reasonable to send back timings in a response header instead of smushing it into the response body, as it is metadata, but I was left with a decision of how to raise the max header size and I went with something overly generous. AFAIK there is no actual limit in HTTP.

I was going to say I was worried about proxies but I remember this entire connection is controlled by us. Yeah this seems fine.

danking · 2023-10-18T22:31:39Z

hail/python/hail/backend/service_backend.py

-                result_bytes = await retry_transient_errors(self._read_output, ir, iodir + '/out', iodir + '/in')
-                return result_bytes, timings
+                result_bytes = await retry_transient_errors(self._read_output, iodir + '/out', iodir + '/in')
+                return result_bytes, str(timings.to_dict())


why the str(...to_dict())?

The local/spark backends return str for their timings, so I was adhering to that signature. Admittedly, I cannot find anywhere in the codebase where timings is actually used, so maybe I should have instead gone the other direction and instantiated a Timings from the java JSON. I can add a test to that effect if you have an opinion on one or the other, or should we just scrap the timings altogether?

Hmm. Let's not touch in this PR. I don't really recall how timings works.

danking · 2023-10-18T22:32:57Z

hail/python/hail/backend/service_backend.py

+
+    async def _async_rpc(self, action: ActionTag, payload: ActionPayload):
+        if action == ActionTag.EXECUTE:
+            assert isinstance(payload, ExecutePayload)


This kind of pattern makes me wonder if the tag should just hang off of the payload e.g. .tag?

Ya I kind of agree, except on the server I switch on the tag to determine how I should deserialize the payload, so it didn't make sense to me for that to be part of the payload. Also, the local/spark backends don't use the tag, they use a URL route, so it wouldn't be helpful to also send the tag. Open to alternative suggestions.

I suppose an alternative that side-steps this whole thing could be: move the IR functions and idempotency token out of the execute payload and make them fields of the ServiceBackend config. Then we don't need to downcast the payload but can still choose only to send IR functions on ActionTag.EXECUTE.

Hmm. I leave your comment up to you. I'm OK with this all as it is and don't have my head deep enough in this to have strong opinions about whether ir functions and tokens are in execute or outside.

I think I'll leave as is now and we can revisit in a separate PR

hail/src/main/scala/is/hail/backend/Backend.scala

hail/src/main/scala/is/hail/backend/spark/SparkBackend.scala

danking · 2023-10-18T22:57:28Z

hail/src/main/scala/is/hail/backend/service/ServiceBackend.scala

+  return_type: String,
+  rendered_body: String,
+)
+
 class ServiceBackendSocketAPI2(


We should give this a better name now. It's not a socket api at all, it's just an API I guess?

Renamed to ServiceBackendAPI. I could be convinced to not have this class at all and just put the main method on the object ServiceBackend

addressed

danking

minor things

hail/python/hail/backend/local_backend.py

hail/python/hail/ir/ir.py

hail/src/main/scala/is/hail/backend/service/ServiceBackend.scala

hail/src/main/scala/is/hail/HailContext.scala

danking · 2023-10-20T19:47:24Z

hail/src/main/scala/is/hail/backend/service/Worker.scala

@@ -158,11 +158,11 @@ object Worker {
    timer.start("executeFunction")

    if (HailContext.isInitialized) {
-      HailContext.get.backend = new ServiceBackend(null, null, new HailClassLoader(getClass().getClassLoader()), null, None)
+      HailContext.get.backend = new ServiceBackend(null, null, new HailClassLoader(getClass().getClassLoader()), null, None, null, null, null, null)


This is feeling increasingly wrong. I think after this PR lands I'll PR something that creates a QoBWorkerBackend which is just the class loader and raises UnsupportedOperationException for everything else.

that would be great

addressed

danking · 2023-10-20T20:35:12Z

hail/python/hail/backend/spark_backend.py

@@ -146,13 +146,10 @@ def validate_file(self, uri: str) -> None:
        validate_file(uri, self._router_async_fs)

    def stop(self):
+        super().stop()
        self._jbackend.close()


stopping the hail context calls this, so you can just remove it.

But it's idempotent so you don't need to. I'll approve but let's nix it.

daniel-goldstein force-pushed the no-py4j branch 4 times, most recently from baf8613 to b911fce Compare October 12, 2023 21:24

danking mentioned this pull request Oct 13, 2023

[query] Don't use py4j for Backend operations #13756

Closed

[query] Avoid py4j for python-backend interactions

dfeca5a

daniel-goldstein force-pushed the no-py4j branch from edd8f8a to dfeca5a Compare October 13, 2023 21:54

danking self-assigned this Oct 16, 2023

daniel-goldstein added 5 commits October 16, 2023 17:01

refactor service backend

06dc270

actually listen to timed parameter

c98c837

increase requests header length to accommodate very large timings json

15535bf

move where we add default references

2747109

more cleanup

761d8d0

delete a couple unused methods

b853d05

daniel-goldstein marked this pull request as ready for review October 17, 2023 14:26

fix

b267c9a

daniel-goldstein added 3 commits October 17, 2023 16:37

make the ServiceBackend return the element not the tuple just like th…

a9c8f01

…e other backends

some more fixes

530604d

facepalm emoji

0413c75

danking previously requested changes Oct 18, 2023

View reviewed changes

daniel-goldstein added 6 commits October 18, 2023 19:04

fix

73c905b

eliminate a lot of redundancy between local_backend and spark_backend

aa3ffcc

move scala server into its own file

b0d3ee1

use integer ids for persisted IR

39638c5

rename socket api to just api

ecd1bbb

update irMap in java tests

4f0b9b7

daniel-goldstein linked an issue Oct 20, 2023 that may be closed by this pull request

[qob] Since 0.2.117 vds.filter_intervals unnecessarily uses a lot of RAM #13748

Closed

danking previously requested changes Oct 20, 2023

View reviewed changes

cleanup

0dfad45

danking approved these changes Oct 20, 2023

View reviewed changes

remove redundant close of jbackend in spark_backend

86ed9c8

danking merged commit c73386f into hail-is:main Oct 20, 2023
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[query] Avoid py4j for python-backend interactions #13797

[query] Avoid py4j for python-backend interactions #13797

daniel-goldstein commented Oct 11, 2023 •

edited by danking

Loading

danking commented Oct 16, 2023

daniel-goldstein commented Oct 17, 2023 •

edited

Loading

danking commented Oct 17, 2023

danking left a comment

danking Oct 18, 2023

danking Oct 18, 2023

daniel-goldstein Oct 19, 2023

danking Oct 18, 2023

daniel-goldstein Oct 19, 2023

danking Oct 20, 2023

danking Oct 18, 2023

daniel-goldstein Oct 19, 2023 •

edited

Loading

danking Oct 20, 2023

danking Oct 18, 2023

daniel-goldstein Oct 19, 2023 •

edited

Loading

daniel-goldstein Oct 19, 2023 •

edited

Loading

danking Oct 20, 2023

daniel-goldstein Oct 20, 2023

danking Oct 18, 2023

daniel-goldstein Oct 19, 2023 •

edited

Loading

danking left a comment

danking Oct 20, 2023

daniel-goldstein Oct 20, 2023

danking Oct 20, 2023

danking Oct 20, 2023

[query] Avoid py4j for python-backend interactions #13797

[query] Avoid py4j for python-backend interactions #13797

Conversation

daniel-goldstein commented Oct 11, 2023 • edited by danking Loading

danking commented Oct 16, 2023

daniel-goldstein commented Oct 17, 2023 • edited Loading

danking commented Oct 17, 2023

danking left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daniel-goldstein Oct 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daniel-goldstein Oct 19, 2023 • edited Loading

Choose a reason for hiding this comment

daniel-goldstein Oct 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daniel-goldstein Oct 19, 2023 • edited Loading

Choose a reason for hiding this comment

danking left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daniel-goldstein commented Oct 11, 2023 •

edited by danking

Loading

daniel-goldstein commented Oct 17, 2023 •

edited

Loading

daniel-goldstein Oct 19, 2023 •

edited

Loading

daniel-goldstein Oct 19, 2023 •

edited

Loading

daniel-goldstein Oct 19, 2023 •

edited

Loading

daniel-goldstein Oct 19, 2023 •

edited

Loading