Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to perform query just after pageserver restart fails #5620

Closed
alexanderlaw opened this issue Oct 22, 2023 · 8 comments
Closed

Attempt to perform query just after pageserver restart fails #5620

alexanderlaw opened this issue Oct 22, 2023 · 8 comments
Labels
t/bug Issue Type: Bug

Comments

@alexanderlaw
Copy link

Steps to reproduce

cargo neon init --force
cargo neon start
cargo neon tenant create --pg-version 16 --set-default
cargo neon endpoint create --pg-version 16 main
cargo neon endpoint start main

psql postgres -c "SELECT generate_series(1, 10000) g INTO TABLE t1" # successful

cargo neon pageserver stop
cargo neon pageserver start
curl "http://127.0.0.1:9898/v1/tenant" &
time psql postgres -c "SELECT generate_series(1, 10000) g INTO TABLE t2"

Result

[{"id":"96733b9e86549fa222b96f0301560021","state":{"slug":"Loading"},"current_physical_size":null,"attachment_status":{"slug":"attached"}}]
...waiting...
^C
^CCancel request sent
FATAL:  terminating connection due to administrator command
LINE 1: SELECT generate_series(1, 10000) g INTO TABLE t2
               ^
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
connection to server was lost

real    6m45.913s
user    0m0.012s
sys     0m0.003s

Logs, links

ls -lh .neon/pageserver_1/pageserver.log 
-rw-rw-r-- 1 law law 30G Oct 22 11:02 .neon/pageserver_1/pageserver.log

.neon/pageserver_1/pageserver.log contains:

2023-10-22T07:56:05.882681Z ERROR load{tenant_id=96733b9e86549fa222b96f0301560021}: load failed, setting tenant state to Broken: Failed to load local timeline: 64a290f2266c2250d04468f2c31cd9c1

Caused by:
    Timeline has no ancestor and no layer files

Stack backtrace:
   0: pageserver::tenant::Tenant::timeline_init_and_sync::{{closure}}
             at /src/neon/pageserver/src/tenant.rs:497:9
   1: pageserver::tenant::Tenant::load_local_timeline::{{closure}}::{{closure}}
             at /src/neon/pageserver/src/tenant.rs:1560:10
   2: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
...
2023-10-22T07:56:06.518980Z  INFO layer flush task{tenant_id=96733b9e86549fa222b96f0301560021 timeline_id=64a290f2266c2250d04468f2c31cd9c1}: started flush loop
2023-10-22T07:56:06.518981Z  INFO Remote part of initial load completed (0.800s since start) elapsed_ms=800
2023-10-22T07:56:06.519304Z  INFO Initial load completed (0.801s since start) elapsed_ms=800
2023-10-22T07:56:06.519007Z ERROR page_service_conn_main{peer_addr=127.0.0.1:60080}: query handler for 'pagestream 96733b9e86549fa222b96f0301560021 64a290f2266c2250d04468f2c31cd9c1' failed: Tenant 96733b9e86549fa222b96f0301560021 will not become active. Current state: Broken due to: Failed to load local timeline: 64a290f2266c2250d04468f2c31cd9c1. Backtrace:
   0: pageserver_api::models::TenantState::broken_from_reason
             at /src/neon/libs/pageserver_api/src/models.rs:133:51
   1: pageserver::tenant::Tenant::spawn_load::{{closure}}::{{closure}}::{{closure}}
             at /src/neon/pageserver/src/tenant.rs:937:34
$ grep 'Broken due to: Failed to load local timeline' .neon/pageserver_1/pageserver.log | wc -l
1148287

===

ls -lh .neon/endpoints/main/compute.log
-rw-rw-r-- 1 law law 16G Oct 22 11:02 .neon/endpoints/main/compute.log

.neon/endpoints/main/compute.log contains:


2023-10-22 07:56:04.798 GMT [889901] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user@127.0.0.1:64000'
2023-10-22 07:56:05.014 GMT [889895] LOG:  standby "walproposer" is now a synchronous standby with priority 1
2023-10-22 07:56:05.305 GMT [889901] LOG:  [NEON_SMGR] pageserver_receive disconnect because call_PQgetCopyData returns -1:  at character 116
2023-10-22 07:56:05.305 GMT [889901] LOG:  [NEON_SMGR] dropping connection to page server due to error at character 116
2023-10-22 07:56:05.305 GMT [889901] LOG:  [NEON_SMGR] could not establish connection to pageserver at character 116
2023-10-22 07:56:05.305 GMT [889901] DETAIL:  connection to server at "127.0.0.1", port 64000 failed: Connection refused
                Is the server running on that host and accepting TCP/IP connections?
2023-10-22 07:56:05.305 GMT [889901] STATEMENT:  SELECT state, to_char(state_change, 'YYYY-MM-DD"T"HH24:MI:SS.US"Z"') AS state_change
                                 FROM pg_stat_activity
                                 WHERE backend_type = 'client backend'
                                    AND pid != pg_backend_pid()
                                    AND usename != 'cloud_admin';
2023-10-22 07:56:06.536 GMT [889901] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user@127.0.0.1:64000' at character 116
2023-10-22 07:56:06.536 GMT [889901] LOG:  [NEON_SMGR] dropping connection to page server due to error at character 116
2023-10-22 07:56:06.536 GMT [889901] LOG:  [NEON_SMGR] pageserver_send disconnect because failed to send page request (try to reconnect): ERROR:  Tenant 96733b9e86549fa222b96f0301560021 will not become active. Current state: Broken due 
to: Failed to load local timeline: 64a290f2266c2250d04468f2c31cd9c1. Backtrace:
           0: pageserver_api::models::TenantState::broken_from_reason
                     at /src/neon/libs/pageserver_api/src/models.rs:133:51
           1: pageserver::tenant::Tenant::spawn_load::{{closure}}::{{closure}}::{{closure}}
                     at /src/neon/pageserver/src/tenant.rs:937:34
           2: tokio::sync::watch::Sender<T>::send_modify::{{closure}}
                     at /src/rust/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/sync/watch.rs:836:13
           3: tokio::sync::watch::Sender<T>::send_if_modified::{{closure}}
                     at /src/rust/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/sync/watch.rs:907:73
...
2023-10-22 07:56:06.537 GMT [889901] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user@127.0.0.1:64000' at character 116
2023-10-22 07:56:06.537 GMT [889901] LOG:  [NEON_SMGR] dropping connection to page server due to error at character 116
2023-10-22 07:56:06.537 GMT [889901] LOG:  [NEON_SMGR] pageserver_send disconnect because failed to send page request (try to reconnect): ERROR:  Tenant 96733b9e86549fa222b96f0301560021 will not become active. Current state: Broken due to: Failed to load local timeline: 64a290f2266c2250d04468f2c31cd9c1. Backtrace:
           0: pageserver_api::models::TenantState::broken_from_reason
                     at /src/neon/libs/pageserver_api/src/models.rs:133:51
           1: pageserver::tenant::Tenant::spawn_load::{{closure}}::{{closure}}::{{closure}}
                     at /src/neon/pageserver/src/tenant.rs:937:34
           2: tokio::sync::watch::Sender<T>::send_modify::{{closure}}
                     at /src/rust/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/sync/watch.rs:836:13
$ grep 'Broken due to: Failed to load local timeline' .neon/endpoints/main/compute.log | wc -l
1148285
@alexanderlaw alexanderlaw added the t/bug Issue Type: Bug label Oct 22, 2023
@jcsp
Copy link
Collaborator

jcsp commented Oct 23, 2023

This may be a side effect of #5198 in development environments, where neon_local by default doesn't provide any remote storage for the pageserver.

Maybe the reconcile function needs a special case for when there is no remote storage configured. Or, we can make neon_local use a local_fs remote storage by default.

@koivunej
Copy link
Member

koivunej commented Oct 23, 2023

Hmm, I thought it would had supported this still, but agreed, #5198 is to blame.

Or, we can make neon_local use a local_fs remote storage by default.

It's always difficult to remember for me how to configure it, so yes.

@alexanderlaw
Copy link
Author

By the way, when pageserver (re)started rather slowly, I observe a failure of the test test_threshold_based_eviction:

FAILED test_runner/regress/test_threshold_based_eviction.py::test_threshold_based_eviction[debug-pg16] - fixtures.pageserver.http.PageserverApiException: NotFound: tenant Tenant 23ce38f1d7305b2a636cef82f1d56a43 is not active
self = <fixtures.pageserver.http.PageserverHttpClient object at 0x7f2e941f8760>, res = <Response [404]>

    def verbose_error(self, res: requests.Response):
        try:
            res.raise_for_status()
        except requests.RequestException as e:
            try:
                msg = res.json()["msg"]
            except:  # noqa: E722
                msg = ""
>           raise PageserverApiException(msg, res.status_code) from e
E           fixtures.pageserver.http.PageserverApiException: NotFound: tenant Tenant 23ce38f1d7305b2a636cef82f1d56a43 is not active

test_runner/fixtures/pageserver/http.py:167: PageserverApiException

Or with a standalone script:

cargo neon init --force
cargo neon start
cargo neon tenant create --pg-version 16 --set-default
t=$(cargo neon tenant list| cut -f 1 -d " ")
cargo neon endpoint create --pg-version 16 main
cargo neon endpoint start main
cargo neon pageserver stop
cargo neon pageserver start
curl -X PUT  -H "Content-Type: application/json" -d "{\"tenant_id\": \"$t\", \"gc_period\": \"123s\"}" "http://127.0.0.1:9898/v1/tenant/config"
echo ""

I get:

...
Starting pageserver node 1 at '127.0.0.1:64000' in ".neon/pageserver_1".
pageserver started, pid: 2194813
{"msg":"NotFound: tenant Tenant 51ebda24580b47527527400d53d09f05 is not active"}

@koivunej
Copy link
Member

koivunej commented Oct 23, 2023

By the way, when pageserver (re)started rather slowly, I observe a failure of the test test_threshold_based_eviction:

[...]

Or with a standalone script:

[...]

These two invocations are not equivalent anymore because the test configures remote storage but the standalone version does not (yet).

(python test failure and stacktraces)

With the python test failures, showing the full stacktrace is useful. Sadly with plain ./scripts/pytest you get the verbose one which does source listings starting in a dependency, one needs to use --tb=short to get a human readable one.

The error here points to our http client we use in tests, which is not useful. Full stacktrace would contain the test and line in test.

fixtures.pageserver.http.PageserverApiException: NotFound: tenant Tenant 23ce38f1d7305b2a636cef82f1d56a43 is not active

The "tenant is not yet ready" is a known issue. There is an open PR for waiting tenants to be ready in the startup by default to get rid of all this flakyness (#4864).

@alexanderlaw
Copy link
Author

@koivunej , thanks for the information!

@koivunej
Copy link
Member

koivunej commented Oct 24, 2023

I promised to understand this before tomorrow's release, and I did:

(Some(x), None) => Err(DismissedLayer::LocalOnly(x)),

So it is as John assumed.

@koivunej
Copy link
Member

#5760 will indirectly fix this by always setting up remote storage. Leaving this open while remote_storage = None (default for cargo neon init at the time of finding this issue) is possible.

@hlinnaka
Copy link
Contributor

Remote storage is required now (see #7722), so closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

4 participants