Plan how to scale to 2 GraphQL Executors #5557

mattkrick · 2021-10-21T16:50:29Z

We eventually need to scale horizontally. By that I mean our app should support multiple stateful websocket servers that communicate with multiple stateless graphql executors.

I'm not really sure the best way to do that, so I'm learning & documenting what I read here.

Definitions for our Use Case

Producer: a websocket server that sends a GQL query to a Consumer.
Consumer: a graphql executor
Consumer group: a cluster of consumers

Evaluated Technologies

Bull MQ - This is a job queue. It works primarily using redis pub/sub. Pub/Sub means that every consumer receives the message, but only 1 wins.
Kakfa - This is probably the most production-ready approach. There are SaaS offerings like NATS.io, but that won't work for our air-gapped solutions. Introducing kafka is possible, but I'd like to limit how many moving parts our app has, if possible.
Redis streams. These came out in 2018 with Redis v5. They are similar to kafka. Message ordering isn't respected like it is in kafka, but that is fine. it's redis, so it's all in-memory, which is great. It's new-ish, so not as popular as the others. Since we have redis already, this looks like a great answer
K8s + load balancer. If each consumer is a pod, then we could put a load balancer in front of each cluster & let k8s handle balancing the jobs. this would require minikube or k3s or something similar for development purposes. i think it'd also have to use HTTP vs. just a message bus.

Learning resources

https://redis.io/topics/streams-intro
Redis Microservices for dummies e-book: https://redis.com/blog/how-redis-fits-with-a-microservices-architecture/
Consumer groups article: https://www.infoworld.com/article/3321938/how-to-use-redis-streams.html

Implementations

https://github.com/hoekma/redpop

Kinda crazy that only 1 person has published an abstraction of this pattern!

Dschoordsch · 2021-10-21T17:06:10Z

How did you come to the Bull MQ conclusion? Did you run tests? I haven't checked the code in detail, but you can run multiple workers and a worker starting a job will lock it, so no other is processing it: https://docs.bullmq.io/guide/workers/stalled-jobs
Not saying we should use it necessarily, I just want to know in case I understood something wrong.

mattkrick · 2021-10-21T17:38:18Z

my bullMQ test was very naive:

const queue = new Queue('Paint')
const worker1 = new Worker('Paint', async (job) => {
  console.log('worker 1 got job', job.data)
})
const worker2 = new Worker('Paint', async (job) => {
  console.log('worker 2 got job', job.data)
})

setInterval(() => {
  queue.add('cars', {ts: Date.now()})
}, 1000)

it would rotate the jobs between the 2 workers, which i really liked, even when i reduced the interval to 0.
it was interesting to me that it doesn't use redis native consumer groups (https://github.com/taskforcesh/bullmq/search?q=XGROUP), rather it must implement its own, I assume. I should dig into it more. I still don't really know how it works. I prefer native solutions over things built in typescript, but you're 100% right, I haven't compared the throughput of both to figure out if 1 is better than the other!

Update. I got curious. Here's what bull tells redis on each job. It looks like every message gets published to every worker, but since the ID is used to rpoplpush, only 1 wins. kinda cool!

1634838310.105701 [0 172.18.0.1:38806] "evalsha" "7a4fed284cdf2482fa92d10dd76541ef21ef1ffa" "8" "bull:Paint:wait" "bull:Paint:paused" "bull:Paint:meta" "bull:Paint:id" "bull:Paint:delayed" "bull:Paint:priority" "bull:Paint:events" "bull:Paint:delay" "\x97\xabbull:Paint:\xa0\xa4cars\xcbBw\xca?N\xcd\x80\x00\xc0\xc0\xc0" "{\"ts\":1634838310103}" "\xde\x00\x04\xa8attempts\x00\xa5delay\x00\xa5jobId\xc0\xa7backoff\xc0"
1634838310.105885 [0 lua] "INCR" "bull:Paint:id"
1634838310.105940 [0 lua] "HMSET" "bull:Paint:3" "name" "cars" "data" "{\"ts\":1634838310103}" "opts" "{\"delay\":0,\"attempts\":0}" "timestamp" "1634838310104" "delay" "0" "priority" "0"
1634838310.106016 [0 lua] "XADD" "bull:Paint:events" "*" "event" "added" "jobId" "3" "name" "cars" "data" "{\"ts\":1634838310103}" "opts" "{\"delay\":0,\"attempts\":0}"
1634838310.106090 [0 lua] "HEXISTS" "bull:Paint:meta" "paused"
1634838310.106113 [0 lua] "LPUSH" "bull:Paint:wait" "3"
1634838310.106140 [0 lua] "XADD" "bull:Paint:events" "*" "event" "waiting" "jobId" "3"
1634838310.106190 [0 lua] "HGET" "bull:Paint:meta" "opts.maxLenEvents"
1634838310.106226 [0 lua] "XTRIM" "bull:Paint:events" "MAXLEN" "~" "10000"
1634838310.107701 [0 172.18.0.1:38812] "evalsha" "e202ec7534dbc9c310ed746f9dc763d0577b0aaf" "8" "bull:Paint:wait" "bull:Paint:active" "bull:Paint:priority" "bull:Paint:events" "bull:Paint:stalled" "bull:Paint:limiter" "bull:Paint:delayed" "bull:Paint:delay" "bull:Paint:" "56f0a0d7-6e74-48f0-bad9-4dd82337f5ca" "30000" "1634838310107" "3"
1634838310.107852 [0 lua] "SREM" "bull:Paint:stalled" "3"
1634838310.107880 [0 lua] "SET" "bull:Paint:3:lock" "56f0a0d7-6e74-48f0-bad9-4dd82337f5ca" "PX" "30000"
1634838310.107925 [0 lua] "ZREM" "bull:Paint:priority" "3"
1634838310.107944 [0 lua] "XADD" "bull:Paint:events" "*" "event" "active" "jobId" "3" "prev" "waiting"
1634838310.108018 [0 lua] "HSET" "bull:Paint:3" "processedOn" "1634838310107"
1634838310.108048 [0 lua] "HGETALL" "bull:Paint:3"
1634838310.110879 [0 172.18.0.1:38812] "evalsha" "7328b0410ffef28f6cdd2caf6ff9b89b48a83074" "8" "bull:Paint:active" "bull:Paint:completed" "bull:Paint:3" "bull:Paint:wait" "bull:Paint:priority" "bull:Paint:events" "bull:Paint:meta" "bull:Paint:stalled" "3" "1634838310110" "returnvalue" "null" "completed" "" "{\"jobId\":\"3\"}" "1" "bull:Paint:" "56f0a0d7-6e74-48f0-bad9-4dd82337f5ca" "30000" "" ""
1634838310.111051 [0 lua] "EXISTS" "bull:Paint:3"
1634838310.111073 [0 lua] "SCARD" "bull:Paint:3:dependencies"
1634838310.111094 [0 lua] "GET" "bull:Paint:3:lock"
1634838310.111112 [0 lua] "DEL" "bull:Paint:3:lock"
1634838310.111129 [0 lua] "SREM" "bull:Paint:stalled" "3"
1634838310.111153 [0 lua] "LREM" "bull:Paint:active" "-1" "3"
1634838310.111181 [0 lua] "ZADD" "bull:Paint:completed" "1634838310110" "3"
1634838310.111215 [0 lua] "HMSET" "bull:Paint:3" "returnvalue" "null" "finishedOn" "1634838310110"
1634838310.111255 [0 lua] "XADD" "bull:Paint:events" "*" "event" "completed" "jobId" "3" "returnvalue" "null"
1634838310.111314 [0 lua] "RPOPLPUSH" "bull:Paint:wait" "bull:Paint:active"
1634838310.111338 [0 lua] "HGET" "bull:Paint:meta" "opts.maxLenEvents"
1634838310.111363 [0 lua] "XTRIM" "bull:Paint:events" "MAXLEN" "~" "10000"
1634838310.112656 [0 172.18.0.1:38812] "evalsha" "e202ec7534dbc9c310ed746f9dc763d0577b0aaf" "8" "bull:Paint:wait" "bull:Paint:active" "bull:Paint:priority" "bull:Paint:events" "bull:Paint:stalled" "bull:Paint:limiter" "bull:Paint:delayed" "bull:Paint:delay" "bull:Paint:" "56f0a0d7-6e74-48f0-bad9-4dd82337f5ca" "30000" "1634838310112" ""
1634838310.112804 [0 lua] "RPOPLPUSH" "bull:Paint:wait" "bull:Paint:active"
1634838310.112855 [0 lua] "XADD" "bull:Paint:events" "*" "event" "drained"
1634838310.113793 [0 172.18.0.1:38810] "brpoplpush" "bull:Paint:wait" "bull:Paint:active" "5"
1634838310.290681 [0 172.18.0.1:38808] "brpoplpush" "bull:Paint:wait" "bull:Paint:active" "5"

Why BullMQ doesn't use streams: taskforcesh/bullmq#47 (comment)
BullMQ might have a memory leak somewhere: redis/ioredis#1285 (comment)

mariogruizdiaz · 2021-10-26T19:03:33Z

Hey Matt!
I think the concept of consumer group is the key to letting us have more than one instance of the GQExecutor. In fact, any replica service/microservice that is subscribed to a topic, not only GQ components.

NATS.io lets implement it in a very simple way.
Let's suppose that we have:

Service A subscribed to a topic M
Service B subscribed to the topic Y
Service C subscribed to the topic M

... and we want to scale them horizontally, in the way we could 2 replicas of them.
In order to be able to let our services prepared for that, NATS.io suggests we subscribe the services to their topics by providing a group name (..I would suggest using the service name as the group name). So, the 2 replicas of service A will subscribe to Topic M under the group name A and so on.

When a message arrives for topic M, Nats.io will send a message to one unique consumer. The uniqueness is based on the group names. So NATS.io will distribute the message under the for topic M just to one replica of service A and just to one replica of service C (applying a load balancer strategy).

An important aspect of this approach is that there is no distribution of messages to consumers that will dismiss it by untargeted consumers. The service bus won't be overloaded with unnecessary traffic of messages.

reference doc: [https://docs.nats.io/nats-concepts/queue]

I use NATS.io is just the framework that I picked up to explain the concept. We can continue analyzing that the best framework to implement this feature.

In regards to having k8s locally, Docker for Mac and Docker for Windows provides an integrated Kubernetes cluster

mattkrick · 2021-10-27T20:16:51Z

An important aspect of this approach is that there is no distribution of messages to consumers that will dismiss it by untargeted consumers. The service bus won't be overloaded with unnecessary traffic of messages.

i think that's an interesting distinction! that's one of the reasons i went with redis consumer groups. less traffic, smaller chance of a memory leak (i trust redis more than a JS solution), and 1 less package to include in our app.

mattkrick added the discussion label Oct 21, 2021

mattkrick mentioned this issue Oct 22, 2021

Use 2 GraphQL Executors #5560

Merged

3 tasks

mattkrick closed this as completed in #5560 Nov 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plan how to scale to 2 GraphQL Executors #5557

Plan how to scale to 2 GraphQL Executors #5557

mattkrick commented Oct 21, 2021 •

edited

Loading

Dschoordsch commented Oct 21, 2021

mattkrick commented Oct 21, 2021 •

edited

Loading

mariogruizdiaz commented Oct 26, 2021

mattkrick commented Oct 27, 2021

Plan how to scale to 2 GraphQL Executors #5557

Plan how to scale to 2 GraphQL Executors #5557

Comments

mattkrick commented Oct 21, 2021 • edited Loading

Definitions for our Use Case

Evaluated Technologies

Learning resources

Implementations

Dschoordsch commented Oct 21, 2021

mattkrick commented Oct 21, 2021 • edited Loading

mariogruizdiaz commented Oct 26, 2021

mattkrick commented Oct 27, 2021

mattkrick commented Oct 21, 2021 •

edited

Loading

mattkrick commented Oct 21, 2021 •

edited

Loading