page_service: don't count time spent flushing towards smgr latency metrics #10042

problame · 2024-12-06T17:28:56Z

Problem

In #9962 I changed the smgr metrics to include time spent on flush.

It isn't under our (=storage team's) control how long that flush takes because the client can stop reading requests.

Summary of changes

Stop the timer as soon as we've buffered up the response in the pgb_writer.

Track flush time in a separate metric.

pageserver/src/metrics.rs

github-actions · 2024-12-06T18:59:04Z

7066 tests run: 6747 passed, 0 failed, 319 skipped (full report)

Flaky tests (5)

Postgres 16

test_scrubber_physical_gc_ancestors[2]: release-arm64

Postgres 15

test_prefetch[4]: release-x86-64
test_pull_timeline[True]: release-x86-64

Postgres 14

test_prefetch[None]: release-x86-64
test_pull_timeline[True]: release-x86-64

Code coverage* (full report)

functions: 31.4% (8332 of 26527 functions)
lines: 47.7% (65565 of 137349 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
6885e67 at 2024-12-07T09:08:45.651Z :recycle:}

pageserver/src/metrics.rs

…metrics (#10075) ## Problem With pipelining enabled, the time a request spends in the batcher stage counts towards the smgr op latency. If pipelining is disabled, that time is not accounted for. In practice, this results in a jump in smgr getpage latencies in various dashboards and degrades the internal SLO. ## Solution In a similar vein to #10042 and with a similar rationale, this PR stops counting the time spent in batcher stage towards smgr op latency. The smgr op latency metric is reduced to the actual execution time. Time spent in batcher stage is tracked in a separate histogram. I expect to remove that histogram after batching rollout is complete, but it will be helpful in the meantime to reason about the rollout.

problame added 3 commits December 6, 2024 15:51

page_service: don't count flush time into smgr latency metrics

14bab2c

WIP track flush time in new metric

b238ad2

fixups

68effc2

problame requested a review from a team as a code owner December 6, 2024 17:28

problame requested review from yliang412 and VladLazar and removed request for yliang412 December 6, 2024 17:28

problame commented Dec 6, 2024

View reviewed changes

pageserver/src/metrics.rs Show resolved Hide resolved

fix typestate so we don't call smgr_op_end twice

4e4b53b

problame changed the title ~~page_service: don't count flush towards smgr metrics~~ page_service: don't count time spent flushing towards smgr latency metrics Dec 6, 2024

fix tests

cc6d5d7

VladLazar approved these changes Dec 6, 2024

View reviewed changes

pageserver/src/metrics.rs Outdated Show resolved Hide resolved

pageserver/src/metrics.rs Outdated Show resolved Hide resolved

pageserver/src/metrics.rs Show resolved Hide resolved

problame added 2 commits December 6, 2024 20:47

less fuss around typestate, more robustness; #10042 (comment)

9f5dae5

clippy

75a35f1

problame enabled auto-merge December 6, 2024 20:10

Merge branch 'main' into problame/smgr-metrics-dont-count-flush-time

47f0621

problame disabled auto-merge December 7, 2024 07:44

problame added 2 commits December 7, 2024 08:46

don't used timed() after all, it logs on every completed request -,-

0e9dd23

clippy

6885e67

problame enabled auto-merge December 7, 2024 07:47

problame added this pull request to the merge queue Dec 7, 2024

Merged via the queue into main with commit 4d7111f Dec 7, 2024
82 checks passed

problame deleted the problame/smgr-metrics-dont-count-flush-time branch December 7, 2024 08:59

problame mentioned this pull request Dec 10, 2024

page_service: don't count time spent in Batcher towards smgr latency metrics #10075

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

page_service: don't count time spent flushing towards smgr latency metrics #10042

page_service: don't count time spent flushing towards smgr latency metrics #10042

problame commented Dec 6, 2024 •

edited

Loading

github-actions bot commented Dec 6, 2024 •

edited

Loading

Postgres 16

Postgres 15

Postgres 14

page_service: don't count time spent flushing towards smgr latency metrics #10042

page_service: don't count time spent flushing towards smgr latency metrics #10042

Conversation

problame commented Dec 6, 2024 • edited Loading

Problem

Summary of changes

github-actions bot commented Dec 6, 2024 • edited Loading

7066 tests run: 6747 passed, 0 failed, 319 skipped (full report)

Postgres 16

Postgres 15

Postgres 14

Code coverage* (full report)

problame commented Dec 6, 2024 •

edited

Loading

github-actions bot commented Dec 6, 2024 •

edited

Loading