Crashlooping on start #479

rocktavious · 2024-06-12T20:16:18Z

Which Faktory package and version? 1.8.0 Enterprise
Which Faktory worker package and version? N/A (just trying to startup the server)

Are you using an old version? Yes by 1 release
Have you checked the changelogs to see if your issue has been fixed in a later version? Yes

When starting up the faktory server we get a constant crashlooping in kubernetes and the logs don't indicate why when bumpped to a debug level.

This is for our enterprise license faktory we are trying to stand up.

Here are the logs it prints out.

opslevel-faktory-0 server Faktory Enterprise 1.8.0 linux/amd64
opslevel-faktory-0 server © 2024 Contributed Systems LLC.
opslevel-faktory-0 server D 2024-06-12T20:11:18.062Z Options: {:7419 :7420 staging /etc/faktory debug /var/lib/faktory/db}
opslevel-faktory-0 server I 2024-06-12T20:11:18.062Z Licensed to XXXXXXX_REDACTED_XXXXXXXX, max 200 connections
opslevel-faktory-0 server D 2024-06-12T20:11:18.062Z Reading configuration in /etc/faktory/conf.d/statsd.toml
opslevel-faktory-0 server D 2024-06-12T20:11:18.062Z Reading configuration in /etc/faktory/conf.d/webui.toml
opslevel-faktory-0 server I 2024-06-12T20:11:18.062Z Initializing redis storage at /var/lib/faktory/db, socket /var/lib/faktory/db/redis.sock
opslevel-faktory-0 server D 2024-06-12T20:11:18.129Z Booting Redis: /usr/bin/redis-server /tmp/redis.conf --unixsocket /var/lib/faktory/db/redis.sock --loglevel notice --dir /var/lib/faktory/db --logfile /var/lib/faktory/db/redis.log
opslevel-faktory-0 server D 2024-06-12T20:11:18.139Z Redis booted in 10.202475ms
opslevel-faktory-0 server D 2024-06-12T20:11:18.140Z Running Redis v7.0.12
opslevel-faktory-0 server D 2024-06-12T20:11:18.140Z Merged configuration
opslevel-faktory-0 server D 2024-06-12T20:11:18.140Z map[statsd:map[location:datadog-agent.datadog.svc.cluster.local:8125 namespace:faktory queueLatency:[app default runner] tags:[env:dev]] webui:map[title:OpsLevel Dev Faktory]]
opslevel-faktory-0 server I 2024-06-12T20:11:18.140Z Limited to 100 connections in the staging environment
opslevel-faktory-0 server I 2024-06-12T20:11:18.140Z Web server now listening at :7420
opslevel-faktory-0 server D 2024-06-12T20:11:18.140Z No cron configuration found
opslevel-faktory-0 server D 2024-06-12T20:11:18.140Z Cron subsystem managing 0 periodic jobs
opslevel-faktory-0 server I 2024-06-12T20:11:18.140Z Sending statsd metrics to datadog-agent.datadog.svc.cluster.local:8125 with namespace faktory
opslevel-faktory-0 server D 2024-06-12T20:11:18.148Z Stopping scheduled tasks
opslevel-faktory-0 server D 2024-06-12T20:11:18.248Z Stopping storage
opslevel-faktory-0 server D 2024-06-12T20:11:18.248Z Shutting down Redis PID 692

Additionally kubernetes reports the process as exiting fully and cleanly and with exit code 0

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 12 Jun 2024 14:05:11 -0500
      Finished:     Wed, 12 Jun 2024 14:05:11 -0500

The text was updated successfully, but these errors were encountered:

mperham · 2024-06-12T20:18:57Z

I’ve never seen this and there’s no info which gives me any more info. Can you start Faktory manually, outside of k8s, on the same machine?

rocktavious · 2024-06-12T20:53:38Z

Because this is running on a kubernetes cluster and we've locked down access to the nodes directly this is not something I can easily test. Additionally it has been deleted and created a number of times which causes it to schedule to 1 of the 7 nodes in this cluster and it hasn't worked on any of them.

The only difference i see on the node pools between the cluster it does work in and this one is kernal and amazon linux version.

Cluster it works in

OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
Amazon Linux 2   5.10.173-154.642.amzn2.x86_64   containerd://1.6.19

Cluser it doesn't work in

OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
Amazon Linux 2   5.10.215-203.850.amzn2.x86_64   containerd://1.7.11

I'll keep double checking configuration between the two but is it possible the same license file cannot be used for -e staging and -e production ?

I think we ordered 2 licenses but the e-mail i was sent from my procurement team only had 1 license in it. And i noticed the log line says 200 connections when i expected each to only be 100.

mperham · 2024-06-12T21:57:27Z

You can share a single license between two servers, giving each 100 connections. I believe the Licensing wiki page covers that.

mperham · 2024-06-12T21:57:55Z

And yes, the same license can be used in staging and production.

rocktavious · 2024-06-13T16:56:20Z

@mperham what do you mean by giving each 100 connections ? Are you saying if i use the same license key between two instances - even if the environment is set to staging i have to also set FAKTORY_MAX_CONNS=100 for both the production and staging servers?

mperham · 2024-06-13T19:06:16Z

https://github.com/contribsys/faktory/wiki/Licensing#staging

"Faktory has the notion of environment -- you start Faktory in development, staging or production and only pay for production servers."

You have a total of 200 connections in production. You can have two production servers sharing the same license, each with FAKTORY_CONN_MAX=100. You can have unlimited staging servers but each staging server is limited to 100 connections automatically when you use -e staging.

rocktavious · 2024-06-17T14:25:20Z

@mperham
So then something is broken with the startup for faktory. Here is the command i'm using

faktory -b :7419 -w :7420 -l debug -e staging

and i get these logs

opslevel-faktory-0 server Faktory Enterprise 1.9.0 linux/amd64
opslevel-faktory-0 server © 2024 Contributed Systems LLC.
opslevel-faktory-0 server D 2024-06-17T14:21:43.540Z Options: {:7419 :7420 staging /etc/faktory debug /var/lib/faktory/db}
opslevel-faktory-0 server W 2024-06-17T14:21:43.540Z Invalid licensing, please see the Faktory wiki for proper configuration
opslevel-faktory-0 server E 2024-06-17T14:21:43.540Z : No valid licensing found in FAKTORY_LICENSE or /etc/faktory/license:

and then the pod shuts down.

I've also tried giving it the FAKTORY_LICENSE which was sent to us and thats when you get the above logs from the initial ticket.

mperham · 2024-06-17T14:58:57Z

Sounds like you have a good clue. I have to assume there's some annoyance in your licensing configuration preventing startup, like unwanted whitespace or something.

rocktavious · 2024-06-17T15:17:54Z

@mperham I FOUND IT. It totally is the statsd settings.

I switched it back to using the FAKTORY_LICENSE and when using -e staging it stands up as long as i remove the statsd.toml because its pointed to a datadog agent that doesn't exist. So it seems like the statsd initialization is crashing the faktory server if it cannot reach out to the statsd server???

mperham · 2024-06-17T15:28:22Z

That still doesn't make sense to me, Datadog should be more resilient than that.The Statsd protocol uses UDP which is connection-less. The existence of the remote side should be irrelevant.

rocktavious · 2024-06-17T15:40:08Z

@mperham - well i've confirmed it. Adding back the statsd.toml causes it to shutdown.

Here is my config

[statsd]
location = "datadog-agent.datadog.svc.cluster.local:8125"
namespace = "faktory"
tags = ["env:dev"]
queueLatency = ["app", "default", "runner"]

So it seems a few things:

to use -e staging you still need to provide a valid FAKTORY_LICENSE (which i don't think is expected)
if statsd.toml has a location that isn't present or doesn't resolve the faktory application exits with exitcode 0 without any logs to what the problem is.

I've replicated this both on 1.8.0 and 1.9.0 using the enterprise image docker.contribsys.com/contribsys/faktory-ent

mperham · 2024-06-17T15:49:04Z

Yep, I can reproduce this. Fix coming...

mperham · 2024-06-17T16:15:25Z

I've found and fixed the issue with Faktory stopping with no error message.

I 2024-06-17T16:15:07.010Z Sending statsd metrics to mike:8150 with namespace faktory
I 2024-06-17T16:15:07.010Z Web server now listening at localhost:7420
E 2024-06-17T16:15:07.012Z Unable to start Faktory: cannot start server subsystem statsd at mike:8150: lookup mike: no such host
D 2024-06-17T16:15:07.012Z Stopping scheduled tasks
D 2024-06-17T16:15:07.113Z Stopping storage
D 2024-06-17T16:15:07.113Z Shutting down Redis PID 5739

mperham · 2024-06-17T16:30:46Z

I suspect your underlying issue is that the datadog statsd host is not DNS resolvable. Datadog resolves the IP address when it creates the Statsd connection.

rocktavious · 2024-06-17T16:46:06Z

Thanks @mperham - I assume this will be in 1.10 ?

Also what about the -e staging still requiring the FAKTORY_LICENSE env var to be filled? Is that expected?

mperham · 2024-06-17T16:58:27Z

It will be in 1.9.1 or 1.10.

A license is always required outside of the development environment.

mperham closed this as completed Jun 17, 2024

mperham added a commit that referenced this issue Jun 17, 2024

#479

4e583de

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crashlooping on start #479

Crashlooping on start #479

rocktavious commented Jun 12, 2024

mperham commented Jun 12, 2024

rocktavious commented Jun 12, 2024 •

edited

Loading

mperham commented Jun 12, 2024

mperham commented Jun 12, 2024

rocktavious commented Jun 13, 2024

mperham commented Jun 13, 2024 •

edited

Loading

rocktavious commented Jun 17, 2024

mperham commented Jun 17, 2024

rocktavious commented Jun 17, 2024

mperham commented Jun 17, 2024

rocktavious commented Jun 17, 2024

mperham commented Jun 17, 2024

mperham commented Jun 17, 2024

mperham commented Jun 17, 2024

rocktavious commented Jun 17, 2024

mperham commented Jun 17, 2024

Crashlooping on start #479

Crashlooping on start #479

Comments

rocktavious commented Jun 12, 2024

mperham commented Jun 12, 2024

rocktavious commented Jun 12, 2024 • edited Loading

mperham commented Jun 12, 2024

mperham commented Jun 12, 2024

rocktavious commented Jun 13, 2024

mperham commented Jun 13, 2024 • edited Loading

rocktavious commented Jun 17, 2024

mperham commented Jun 17, 2024

rocktavious commented Jun 17, 2024

mperham commented Jun 17, 2024

rocktavious commented Jun 17, 2024

mperham commented Jun 17, 2024

mperham commented Jun 17, 2024

mperham commented Jun 17, 2024

rocktavious commented Jun 17, 2024

mperham commented Jun 17, 2024

rocktavious commented Jun 12, 2024 •

edited

Loading

mperham commented Jun 13, 2024 •

edited

Loading