Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashlooping on start #479

Closed
rocktavious opened this issue Jun 12, 2024 · 16 comments
Closed

Crashlooping on start #479

rocktavious opened this issue Jun 12, 2024 · 16 comments

Comments

@rocktavious
Copy link

  • Which Faktory package and version? 1.8.0 Enterprise
  • Which Faktory worker package and version? N/A (just trying to startup the server)

Are you using an old version? Yes by 1 release
Have you checked the changelogs to see if your issue has been fixed in a later version? Yes

When starting up the faktory server we get a constant crashlooping in kubernetes and the logs don't indicate why when bumpped to a debug level.

This is for our enterprise license faktory we are trying to stand up.

Here are the logs it prints out.

opslevel-faktory-0 server Faktory Enterprise 1.8.0 linux/amd64
opslevel-faktory-0 server © 2024 Contributed Systems LLC.
opslevel-faktory-0 server D 2024-06-12T20:11:18.062Z Options: {:7419 :7420 staging /etc/faktory debug /var/lib/faktory/db}
opslevel-faktory-0 server I 2024-06-12T20:11:18.062Z Licensed to XXXXXXX_REDACTED_XXXXXXXX, max 200 connections
opslevel-faktory-0 server D 2024-06-12T20:11:18.062Z Reading configuration in /etc/faktory/conf.d/statsd.toml
opslevel-faktory-0 server D 2024-06-12T20:11:18.062Z Reading configuration in /etc/faktory/conf.d/webui.toml
opslevel-faktory-0 server I 2024-06-12T20:11:18.062Z Initializing redis storage at /var/lib/faktory/db, socket /var/lib/faktory/db/redis.sock
opslevel-faktory-0 server D 2024-06-12T20:11:18.129Z Booting Redis: /usr/bin/redis-server /tmp/redis.conf --unixsocket /var/lib/faktory/db/redis.sock --loglevel notice --dir /var/lib/faktory/db --logfile /var/lib/faktory/db/redis.log
opslevel-faktory-0 server D 2024-06-12T20:11:18.139Z Redis booted in 10.202475ms
opslevel-faktory-0 server D 2024-06-12T20:11:18.140Z Running Redis v7.0.12
opslevel-faktory-0 server D 2024-06-12T20:11:18.140Z Merged configuration
opslevel-faktory-0 server D 2024-06-12T20:11:18.140Z map[statsd:map[location:datadog-agent.datadog.svc.cluster.local:8125 namespace:faktory queueLatency:[app default runner] tags:[env:dev]] webui:map[title:OpsLevel Dev Faktory]]
opslevel-faktory-0 server I 2024-06-12T20:11:18.140Z Limited to 100 connections in the staging environment
opslevel-faktory-0 server I 2024-06-12T20:11:18.140Z Web server now listening at :7420
opslevel-faktory-0 server D 2024-06-12T20:11:18.140Z No cron configuration found
opslevel-faktory-0 server D 2024-06-12T20:11:18.140Z Cron subsystem managing 0 periodic jobs
opslevel-faktory-0 server I 2024-06-12T20:11:18.140Z Sending statsd metrics to datadog-agent.datadog.svc.cluster.local:8125 with namespace faktory
opslevel-faktory-0 server D 2024-06-12T20:11:18.148Z Stopping scheduled tasks
opslevel-faktory-0 server D 2024-06-12T20:11:18.248Z Stopping storage
opslevel-faktory-0 server D 2024-06-12T20:11:18.248Z Shutting down Redis PID 692

Additionally kubernetes reports the process as exiting fully and cleanly and with exit code 0

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 12 Jun 2024 14:05:11 -0500
      Finished:     Wed, 12 Jun 2024 14:05:11 -0500
@mperham
Copy link
Collaborator

mperham commented Jun 12, 2024

I’ve never seen this and there’s no info which gives me any more info. Can you start Faktory manually, outside of k8s, on the same machine?

@rocktavious
Copy link
Author

rocktavious commented Jun 12, 2024

Because this is running on a kubernetes cluster and we've locked down access to the nodes directly this is not something I can easily test. Additionally it has been deleted and created a number of times which causes it to schedule to 1 of the 7 nodes in this cluster and it hasn't worked on any of them.

The only difference i see on the node pools between the cluster it does work in and this one is kernal and amazon linux version.

Cluster it works in

OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
Amazon Linux 2   5.10.173-154.642.amzn2.x86_64   containerd://1.6.19

Cluser it doesn't work in

OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
Amazon Linux 2   5.10.215-203.850.amzn2.x86_64   containerd://1.7.11

I'll keep double checking configuration between the two but is it possible the same license file cannot be used for -e staging and -e production ?

I think we ordered 2 licenses but the e-mail i was sent from my procurement team only had 1 license in it. And i noticed the log line says 200 connections when i expected each to only be 100.

@mperham
Copy link
Collaborator

mperham commented Jun 12, 2024

You can share a single license between two servers, giving each 100 connections. I believe the Licensing wiki page covers that.

@mperham
Copy link
Collaborator

mperham commented Jun 12, 2024

And yes, the same license can be used in staging and production.

@rocktavious
Copy link
Author

@mperham what do you mean by giving each 100 connections ? Are you saying if i use the same license key between two instances - even if the environment is set to staging i have to also set FAKTORY_MAX_CONNS=100 for both the production and staging servers?

@mperham
Copy link
Collaborator

mperham commented Jun 13, 2024

https://github.com/contribsys/faktory/wiki/Licensing#staging

"Faktory has the notion of environment -- you start Faktory in development, staging or production and only pay for production servers."

You have a total of 200 connections in production. You can have two production servers sharing the same license, each with FAKTORY_CONN_MAX=100. You can have unlimited staging servers but each staging server is limited to 100 connections automatically when you use -e staging.

@rocktavious
Copy link
Author

@mperham
So then something is broken with the startup for faktory. Here is the command i'm using

faktory -b :7419 -w :7420 -l debug -e staging

and i get these logs

opslevel-faktory-0 server Faktory Enterprise 1.9.0 linux/amd64
opslevel-faktory-0 server © 2024 Contributed Systems LLC.
opslevel-faktory-0 server D 2024-06-17T14:21:43.540Z Options: {:7419 :7420 staging /etc/faktory debug /var/lib/faktory/db}
opslevel-faktory-0 server W 2024-06-17T14:21:43.540Z Invalid licensing, please see the Faktory wiki for proper configuration
opslevel-faktory-0 server E 2024-06-17T14:21:43.540Z : No valid licensing found in FAKTORY_LICENSE or /etc/faktory/license: 

and then the pod shuts down.

I've also tried giving it the FAKTORY_LICENSE which was sent to us and thats when you get the above logs from the initial ticket.

@mperham
Copy link
Collaborator

mperham commented Jun 17, 2024

Sounds like you have a good clue. I have to assume there's some annoyance in your licensing configuration preventing startup, like unwanted whitespace or something.

@rocktavious
Copy link
Author

@mperham I FOUND IT. It totally is the statsd settings.

I switched it back to using the FAKTORY_LICENSE and when using -e staging it stands up as long as i remove the statsd.toml because its pointed to a datadog agent that doesn't exist. So it seems like the statsd initialization is crashing the faktory server if it cannot reach out to the statsd server???

@mperham
Copy link
Collaborator

mperham commented Jun 17, 2024

That still doesn't make sense to me, Datadog should be more resilient than that.The Statsd protocol uses UDP which is connection-less. The existence of the remote side should be irrelevant.

@rocktavious
Copy link
Author

@mperham - well i've confirmed it. Adding back the statsd.toml causes it to shutdown.

Here is my config

[statsd]
location = "datadog-agent.datadog.svc.cluster.local:8125"
namespace = "faktory"
tags = ["env:dev"]
queueLatency = ["app", "default", "runner"]

So it seems a few things:

  • to use -e staging you still need to provide a valid FAKTORY_LICENSE (which i don't think is expected)
  • if statsd.toml has a location that isn't present or doesn't resolve the faktory application exits with exitcode 0 without any logs to what the problem is.

I've replicated this both on 1.8.0 and 1.9.0 using the enterprise image docker.contribsys.com/contribsys/faktory-ent

@mperham
Copy link
Collaborator

mperham commented Jun 17, 2024

Yep, I can reproduce this. Fix coming...

@mperham
Copy link
Collaborator

mperham commented Jun 17, 2024

I've found and fixed the issue with Faktory stopping with no error message.

I 2024-06-17T16:15:07.010Z Sending statsd metrics to mike:8150 with namespace faktory
I 2024-06-17T16:15:07.010Z Web server now listening at localhost:7420
E 2024-06-17T16:15:07.012Z Unable to start Faktory: cannot start server subsystem statsd at mike:8150: lookup mike: no such host
D 2024-06-17T16:15:07.012Z Stopping scheduled tasks
D 2024-06-17T16:15:07.113Z Stopping storage
D 2024-06-17T16:15:07.113Z Shutting down Redis PID 5739

@mperham
Copy link
Collaborator

mperham commented Jun 17, 2024

I suspect your underlying issue is that the datadog statsd host is not DNS resolvable. Datadog resolves the IP address when it creates the Statsd connection.

@mperham mperham closed this as completed Jun 17, 2024
@rocktavious
Copy link
Author

Thanks @mperham - I assume this will be in 1.10 ?

Also what about the -e staging still requiring the FAKTORY_LICENSE env var to be filled? Is that expected?

mperham added a commit that referenced this issue Jun 17, 2024
@mperham
Copy link
Collaborator

mperham commented Jun 17, 2024

It will be in 1.9.1 or 1.10.

A license is always required outside of the development environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants