Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run list perf optimizations #149

Closed
yebrahim opened this issue Nov 8, 2018 · 16 comments
Closed

Run list perf optimizations #149

yebrahim opened this issue Nov 8, 2018 · 16 comments

Comments

@yebrahim
Copy link
Contributor

yebrahim commented Nov 8, 2018

This is a quick analysis of the experiment details page performance. Most of the time spent is because we have to make multiple consecutive requests to load all the information we need. We currently do this:​

  1. Call getExperiment API to get the details (name, description.. etc).
  2. Call listJobs API to get all recurring jobs in this experiment.
  3. Call listRuns API to show the first page of runs in this experiment.
  4. For each run (in parallel), call its getRun API to get its details (name, status, duration... etc).
  5. For each run (in parallel), call getPipeline on its pipeline ID, in order to show the pipeline name.
  6. For each run (in parallel), call getExperiment on its experiment ID, if any, to show the experiment name. This is not needed when listing runs of a given experiment, but it's technical debt we accumulated, since we're using the same component to list runs everywhere.

    Some low-hanging perf improvements can be obtained by doing the following:
  • There is no need to do the first three steps in sequence, they're not interdependent.
  • There is no need to do the second three steps in sequence, they're all using the same metadata fields.
  • We can render the list of runs as we get them, before we get the details of each run and its pipeline. This will show their names and statuses, but not their run time or their pipeline name. Subsequent requests then re-render to fill out these fields.
@vicaire
Copy link
Contributor

vicaire commented Nov 8, 2018

@yebrahim, the bottleneck is most likely the number of simultaneous queries made to the API server and the underlying DB.

I suggest the following improvements:

  1. Implement a backend API that that returns all the needed job data (first page) with a single DB query. Implement a backend API that returns all the needed run data (first page) with a single DB query. We could call these APIs ListJobDetails and ListRunDetails.

  2. Increase the number of threads that can handle request concurrently in the Backend. (Depending on what it currently is).

  3. Increase the number of backend servers that can handle requests. (We probably don't want to rely on this though, as running additional servers is costly for the user.)

@yebrahim
Copy link
Contributor Author

yebrahim commented Nov 9, 2018

I can't get request service time from the API server, @IronPan can maybe comment on this.
I think you're underestimating network latency though. Compared to how long it takes the request to reach the API server through the k8s network, the time it takes the API server to serve the requests, even hundreds of parallel requests, should be negligible.

In any case, here's a waterfall of requests vs render time:

image
This maps nicely to the 6 items I listed above.

The average request time shown here is about 40ms, and I think all of that is network latency. I might be wrong here of course, there's no way for me to tell. Some profiling on the API server is needed.

@yebrahim
Copy link
Contributor Author

yebrahim commented Nov 9, 2018

One more thing: for each request, we actually make two network hops instead of one, since we use the frontend webserver as a proxy.
There are two reasons behind this design:

  1. Our backend doesn't allow CORS.
  2. The webserver handles more work than just being a proxy: make k8s master API calls on behalf of the client, read artifacts from GCS and minio, report health and build info on both backend and frontend, and house Tensorboard proxy.

@vicaire
Copy link
Contributor

vicaire commented Nov 9, 2018

Does the frontend reuse the same connection pool to make requests? Or does it start a new connection for each request? If the later, that could be the cause of the latency.

@yebrahim
Copy link
Contributor Author

yebrahim commented Nov 9, 2018

What's a connection pool?

@vicaire
Copy link
Contributor

vicaire commented Nov 9, 2018

Here is a link:

https://en.wikipedia.org/wiki/Connection_pool

The link discusses database connections but it also applies to service connections.

To summarize, the UI should only establish the connection to the service a few times and keep reusing the same connections (that's the pool of connections). Re-creating new connections every time would lead to a significant increase in latency.

@IronPan
Copy link
Member

IronPan commented Jan 31, 2019

Maybe we should consider using tools such as https://github.com/prometheus/client_golang to instrumenting backend perf

@yebrahim
Copy link
Contributor Author

yebrahim commented Feb 8, 2019

We're bound by network latency here. Some more numbers:

Endpoint from API server pod from client (browser/curl out side cluster)
List runs 20 170
List pipelines 5 135
List experiments 4 135
Get run 11 155
Get pipeline 5 135
Get experiment 3 132

Numbers are all in milliseconds, and gathered using curl both inside the API server's pod and outside the cluster.

I can see there's potential for improving the list run's server endpoint performance, but this is still negligible compared to the observed difference between within and without the pod, which I can only think is attributed to network latency.

I'm not sure there's a way to improve the latency here, I'll look into this next, but it seems to me like the low hanging fruit would be not to block on all requests before rendering the UI. Perhaps it's enough in the list runs call for example to show the data as it comes in, rather than waiting on gathering all bits of data.

@vicaire
Copy link
Contributor

vicaire commented Feb 11, 2019

yebrahim@,

The latency looks OK for a network call (~100ms).

Rather than spending time on rendering the UI progressively, would it be possible to reduce the number of queries that are made to the backend?

This could be done by adding some optimized, paginated queries to the backend.

@yebrahim
Copy link
Contributor Author

I'd like to look a little more into the network latency before dismissing it just to make sure we're not missing anything, but regardless, progressive rendering is a low hanging fruit, it should be less work than changing/add APIs and using them.

However, API change seems like the best resolution to the problem. Ideally, we'd switch to something like GraphQL, but barring that, we can modify the list response body to include references to other objects (experiments and pipelines namely). This should improve the list performance by a 3x factor.

@vicaire
Copy link
Contributor

vicaire commented Feb 15, 2019

Agree that an API change is the best solution. We can make returning the extra data optional.

Switching to GraphQL would be great. If you would take this on (starting with a design proposal), that would be a fantastic contribution.

I am not clear about the 3x improvement. When does the current performance start to degrade? Depending on the number of rows to render, how many queries are made today? How many queries would be made after your change? Is it possible to change the number of queries to just a couple no matter the number of rows?

@yebrahim
Copy link
Contributor Author

The 3x figure is actually a conservative estimate, based on the analysis in this comment. We have enough data to render by the time we list runs for a given experiment (or all runs), but we still wait until we get the experiment's jobs, then each run's pipelines, then each run's experiment.

The number of queries doesn't change, it's just that they'd be dispatched in parallel with the UI rendering the data incrementally as it's available.

@vicaire
Copy link
Contributor

vicaire commented Feb 16, 2019

Did you check how the performance scales? What if there are 10 jobs or runs in the DB? 100? 1000? 10,000? 100,000? 1,000,000? What if the UI is rendering 10, 100, 200?

@yebrahim
Copy link
Contributor Author

Thanks for the suggestion, that did surface another issue, which can be seen in this screenshot:
image

These requests were all started in parallel (they all started within ~50ms of each other), but Chrome has a maximum number of concurrent connections of 6 to any given domain (see here and it seems like most modern browsers do this. The HTTP/1 spec suggests two (see section 8.1.4 here). So the requests are queued up in batches of 6.

FYI: HTTP/2 has multiplexed requests that can use the same connection.

Changing the API seems like our best bet, to either GraphQL, or to just support getting details of multiple objects in the same request.

@derekhh
Copy link
Contributor

derekhh commented Apr 3, 2019

This might be unrelated but the pipeline dashboard UI on our GKE cluster was really slow yesterday (we had probably around 100 runs) and I cleaned up the pods used by previous workflows following this comment: #844 (comment)

Then the speed is back to normal again.

@rileyjbauer
Copy link
Contributor

Closing for now given the improvements to list queries

Linchin pushed a commit to Linchin/pipelines that referenced this issue Apr 11, 2023
magdalenakuhn17 pushed a commit to magdalenakuhn17/pipelines that referenced this issue Oct 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants