Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

renterd worker-only node fails to start-up #1302

Closed
artur9010 opened this issue Jun 13, 2024 · 11 comments · Fixed by #1324
Closed

renterd worker-only node fails to start-up #1302

artur9010 opened this issue Jun 13, 2024 · 11 comments · Fixed by #1324
Assignees
Labels
bug Something isn't working

Comments

@artur9010
Copy link

Current Behavior

Looks like renterd worker-only node fails to start up, it tries to connect to itself and gets 404

Using RENTERD_BUS_REMOTE_ADDR environment variable
Using RENTERD_BUS_API_PASSWORD environment variable
Using RENTERD_WORKER_API_PASSWORD environment variable
Using RENTERD_WORKER_ENABLED environment variable
Using RENTERD_WORKER_ID environment variable
Using RENTERD_AUTOPILOT_ENABLED environment variable
2024-06-13T15:08:57Z	INFO	renterd	{"version": "c7b80d4", "network": "Zen Testnet", "commit": "c7b80d4", "buildDate": "2024-06-13T13:05:25Z"}
2024-06-13T15:08:57Z	INFO	connecting to remote bus at http://renterd-bus:9980/api/bus
2024-06-13T15:08:57Z	FATAL	failed to setup worker: failed to register webhook 'http://[::]:9980/api/worker/events.consensus.update', err: failed to add Webhook: Webhook returned unexpected status 404:

envs passed to workers:

~ $ env | grep RENTERD
RENTERD_SEED=*removed*
RENTERD_BUS_API_PASSWORD=*removed*
RENTERD_WORKER_API_PASSWORD=*removed*
RENTERD_BUS_REMOTE_ADDR=http://renterd-bus:9980/api/bus
RENTERD_CONFIG_FILE=/data/renterd.yml
RENTERD_API_PASSWORD=*removed*
RENTERD_AUTOPILOT_ENABLED=false
RENTERD_WORKER_ENABLED=true
RENTERD_WORKER_ID=renterd-worker-1

Expected Behavior

Start without failing

Steps to Reproduce

No response

Version

c7b80d4

What operating system did the problem occur on (e.g. Ubuntu 22.04, macOS 12.0, Windows 11)?

Ubuntu 22.04, kubernetes (https://github.com/artur9010/charts/tree/master/renterd)

Autopilot Config

probably not needed

Bus Config

probably not needed

Contract Set Contracts

probably not needed

Anything else?

No response

@artur9010 artur9010 added the bug Something isn't working label Jun 13, 2024
@n8maninger
Copy link
Member

n8maninger commented Jun 13, 2024

What commit is your bus on?

@artur9010
Copy link
Author

artur9010 commented Jun 13, 2024

2024-06-13T15:06:43Z INFO renterd {"version": "c7b80d4", "network": "Zen Testnet", "commit": "c7b80d4", "buildDate": "2024-06-13T13:05:25Z"}
and it (bus) started without any issue

it also did a migration before this error occurred

2024-06-13T15:06:44Z	INFO	sql	migration '00010_webhook_headers' complete

@artur9010
Copy link
Author

Looks like enabling worker (RENTERD_WORKER_ENABLED = true) on bus container solved the issue, worker containers are now starting without any problem.

Also, error message suggests that error is related to worker container, not the bus one
failed to setup worker: failed to register webhook 'http://[::]:9980/api/worker/events.consensus.update'

@peterjan
Copy link
Member

Hm ok so definitely related to the worker cache PR. I tested a multi node cluster setup though before moving it into review.

This is the bus side:

RENTERD_WORKER_ENABLED=false RENTERD_WORKER_REMOTE_ADDRS=http://127.0.0.1:9970 RENTERD_WORKER_API_PASSWORD=test ./renterd --openui=false

This is the worker side:

 RENTERD_S3_ENABLED=false RENTERD_AUTOPILOT_ENABLED=false  RENTERD_BUS_REMOTE_ADDR=http://127.0.0.1:9880/api/bus RENTERD_BUS_API_PASSWORD=test ./renterd  --http=:9970 --openui=false

This works for me on latest dev (c7b80d419db742c82a80fba7f50beec3a896eb0a). It looks like your bus is not finding the worker, how did you configure the bus? Your last message confuses me by the way, I don't see how changing something in your bus container would all of a sudden make your worker containers work. In any case in a multi node setup you likely wouldn't want to enable the worker on your bus node so perhaps we should not focus on that.

@peterjan peterjan self-assigned this Jun 14, 2024
@artur9010
Copy link
Author

artur9010 commented Jun 14, 2024

For me it works only if I have worker enabled on bus

https://www.youtube.com/watch?v=PbcckHtLgac

how did you configure the bus

~ $ hostname
renterd-bus-0
~ $ env | grep REN
RENTERD_DB_NAME=renterd
RENTERD_SEED=*removed*
RENTERD_DB_METRICS_NAME=renterd_metrics
RENTERD_BUS_API_PASSWORD=*removed*
RENTERD_WORKER_API_PASSWORD=*removed*
RENTERD_DB_PASSWORD=*removed*
RENTERD_DB_URI=mysql:3306
RENTERD_CONFIG_FILE=/data/renterd.yml
RENTERD_DB_USER=renterd
RENTERD_API_PASSWORD=*removed*
RENTERD_AUTOPILOT_ENABLED=false
RENTERD_WORKER_ENABLED=true <--- if I set this to false, then it stops working
~ $ cat /data/renterd.yml 
# Managed by Helm - configmap/renterd-bus/renterd.yml
bus:
  gatewayAddr: "0.0.0.0:9981"
  
s3:
  enabled: true

and workers

~ $ hostname
renterd-worker-0
~ $ env | grep REN
RENTERD_SEED=removed
RENTERD_BUS_API_PASSWORD=removed
RENTERD_WORKER_API_PASSWORD=removed
RENTERD_BUS_REMOTE_ADDR=http://renterd-bus:9980/api/bus
RENTERD_CONFIG_FILE=/data/renterd.yml
RENTERD_API_PASSWORD=removed
RENTERD_AUTOPILOT_ENABLED=false
RENTERD_WORKER_ENABLED=true
RENTERD_WORKER_ID=renterd-worker-0
~ $ cat /data/renterd.yml 
# Managed by Helm - configmap/renterd-worker/renterd.yml

s3:
  enabled: true

edit:
entrypoint

exec renterd --http=':9980' --s3.address=':8080' --log.file.enabled=false

@peterjan
Copy link
Member

Could you add RENTERD_WORKER_REMOTE_ADDRS to your bus config, and set RENTERD_WORKER_ENABLED to false. That should work. If you configure your settings through a yaml file you have to configure it using worker.remotes, the env variables are deprecated and will be removed soon.

I think we should add some hardening when users configure RENTERD_WORKER_API_PASSWORD but dont' define RENTERD_WORKER_REMOTE_ADDRS, or when they have the bus enabled and these settings are defined maybe. It's definitely not straightforward to configure a clustered setup at the moment.

@artur9010
Copy link
Author

bus

RENTERD_WORKER_ENABLED=false
RENTERD_WORKER_REMOTE_ADDRS=http://renterd-worker-0.renterd-worker:9980/api/worker;http://renterd-worker-1.renterd-worker:9980/api/worker;http://renterd-worker-2.renterd-worker:9980/api/worker;http://renterd-worker-3.renterd-worker:9980/api/worker;http://renterd-worker-4.renterd-worker:9980/api/worker;http://renterd-worker-5.renterd-worker:9980/api/worker

bus logs

Using RENTERD_BUS_API_PASSWORD environment variable
Using RENTERD_DB_URI environment variable
Using RENTERD_DB_USER environment variable
Using RENTERD_DB_PASSWORD environment variable
Using RENTERD_DB_NAME environment variable
Using RENTERD_DB_METRICS_NAME environment variable
Using RENTERD_WORKER_REMOTE_ADDRS environment variable
Using RENTERD_WORKER_API_PASSWORD environment variable
Using RENTERD_WORKER_ENABLED environment variable
Using RENTERD_AUTOPILOT_ENABLED environment variable
2024-06-18T22:40:21Z	INFO	renterd	{"version": "c7b80d4", "network": "Zen Testnet", "commit": "c7b80d4", "buildDate": "2024-06-13T13:05:25Z"}
2024-06-18T22:40:21Z	INFO	sql	Using MySQL version 8.4.0
2024-06-18T22:40:21Z	WARN	sql	slow exec	{"query": "\n\tDELETE FROM slabs\n\tWHERE id IN (\n    SELECT id\n    FROM (\n        SELECT slabs.id\n        FROM slabs\n        WHERE NOT EXISTS (\n            SELECT 1 FROM slices WHERE slices.db_slab_id = slabs.id\n        )\n        AND slabs.db_buffered_slab_id IS NULL\n        LIMIT ?\n    ) AS limited\n\t)", "elapsed": "332.124718ms", "stack": "go.sia.tech/renterd/internal/sql.(*loggedTxn).Exec\n\t/renterd/internal/sql/log.go:100\ngo.sia.tech/renterd/stores/sql/mysql.(*MainDatabaseTx).PruneSlabs\n\t/renterd/stores/sql/mysql/main.go:452\ngo.sia.tech/renterd/stores.(*SQLStore).initSlabPruning.func2\n\t/renterd/stores/sql.go:320\ngo.sia.tech/renterd/stores/sql/mysql.(*MainDatabase).Transaction.func1\n\t/renterd/stores/sql/mysql/main.go:73\ngo.sia.tech/renterd/internal/sql.(*DB).transaction\n\t/renterd/internal/sql/sql.go:217\ngo.sia.tech/renterd/internal/sql.(*DB).Transaction\n\t/renterd/internal/sql/sql.go:149\ngo.sia.tech/renterd/stores/sql/mysql.(*MainDatabase).Transaction\n\t/renterd/stores/sql/mysql/main.go:72\ngo.sia.tech/renterd/stores.(*SQLStore).initSlabPruning\n\t/renterd/stores/sql.go:319\ngo.sia.tech/renterd/stores.NewSQLStore\n\t/renterd/stores/sql.go:293\ngo.sia.tech/renterd/internal/node.NewBus\n\t/renterd/internal/node/node.go:141\nmain.main\n\t/renterd/cmd/renterd/main.go:500\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267"}
2024-06-18T22:40:21Z	INFO	connecting to remote worker at http://renterd-worker-0.renterd-worker:9980/api/worker
2024-06-18T22:40:21Z	INFO	connecting to remote worker at http://renterd-worker-1.renterd-worker:9980/api/worker
2024-06-18T22:40:21Z	INFO	connecting to remote worker at http://renterd-worker-2.renterd-worker:9980/api/worker
2024-06-18T22:40:21Z	INFO	connecting to remote worker at http://renterd-worker-3.renterd-worker:9980/api/worker
2024-06-18T22:40:21Z	INFO	connecting to remote worker at http://renterd-worker-4.renterd-worker:9980/api/worker
2024-06-18T22:40:21Z	INFO	connecting to remote worker at http://renterd-worker-5.renterd-worker:9980/api/worker
2024-06-18T22:40:21Z	INFO	api: Listening on [::]:9980
2024-06-18T22:40:21Z	ERROR	webhooks.webhooks	failed to send Webhook event setting.update to http://[::]:9980/api/worker/events: Webhook returned unexpected status 404:
2024-06-18T22:40:21Z	INFO	bus: Listening on localhost:9981

worker log

Using RENTERD_BUS_REMOTE_ADDR environment variable
Using RENTERD_BUS_API_PASSWORD environment variable
Using RENTERD_WORKER_API_PASSWORD environment variable
Using RENTERD_WORKER_ENABLED environment variable
Using RENTERD_WORKER_ID environment variable
Using RENTERD_AUTOPILOT_ENABLED environment variable
2024-06-18T22:41:56Z	INFO	renterd	{"version": "c7b80d4", "network": "Zen Testnet", "commit": "c7b80d4", "buildDate": "2024-06-13T13:05:25Z"}
2024-06-18T22:41:56Z	INFO	connecting to remote bus at http://renterd-bus:9980/api/bus
2024-06-18T22:41:56Z	FATAL	failed to setup worker: failed to register webhook 'http://[::]:9980/api/worker/events.consensus.update', err: failed to add Webhook: Webhook returned unexpected status 404:

it isn't starting anymore

@peterjan
Copy link
Member

peterjan commented Jun 20, 2024

Ok, well it's because of your worker config. It's the worker's http config that defaults to localhost:9980. The worker is essentially asking the bus to contact it on that address while it should be http://renterd-worker-0.renterd-worker:9980. I guess I'll have to set it up myself so I can guide you through the setup. I've never really ran it in a distributed setup using containers, if I want to test it I just do it locally using different ports. We might need to add an extra config for this now that I think about it more.. when a worker has to connect to a remote bus, it probably needs to know where it lives on the network.

@peterjan
Copy link
Member

@artur9010 I opened a PR #1324 to allow you to specify the worker's API endpoint, thank you for reporting it

@artur9010
Copy link
Author

Wouldn't adding a requirement to specify workers addrs make it harder to add additional worker without downtime?

RENTERD_WORKER_REMOTE_ADDRS=http://worker:9980/api/worker

In my case where I run renterd inside k8s cluster I generate and specify worker's addrs inside pod template, any change to it (eq. scale up/down) requires to delete and recreate a pod. Workers can't work without bus, but bus needs to know worker addresses before start. While I think it is ok to specify worker addrs for autopilot (as it can be run outside of bus) - in contrast to bus it can be restarted without causing any issue, bus imo shouldn't need that - workers should register to bus somehow (maybe an option to specify node port and ip in worker and add some heartbeat to workers)

I guess that bus will proxy all /api/worker requests to addresses specified in RENTERD_WORKER_REMOTE_ADDRS?

@peterjan
Copy link
Member

peterjan commented Jun 21, 2024

Hm, well it's definitely a use case I'm not familiar with, I've never set up renterd as a k8s cluster. I'm not sure I'm following though because the bus is not aware of the worker pool. The worker needs a bus though, and it's the autopilot that needs to know about all workers. Upscaling workers would require restarting the autopilot, but not the bus. The bus does not process any /api/worker requests, the bus doesn't expose the API it only exposes the bus API. #1324 is still in review, it'll probably change but I don't think we can get around adding a dedicated field to specify the worker's API address in cases where we're dealing with a remote bus.

Edit: re-reading your comment I think the confusion comes from the fact that it is unclear what service depends on what config. I'll look into documenting that, for instance, the env variable you mention (RENTERD_WORKER_REMOTE_ADDRS) is not needed to start a worker or bus node, it's only the autopilot that needs it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants