Achieving Resiliency With Queues: Building A System That Never Skips A Beat In A Billion

Braze processes billions and billions of events per day on behalf of its customers, resulting in billions of hyper-focused, personalized messages—but failing to send one of those messages has consequences. To make sure those key messages are always correct and always on time, Braze takes a strategic approach to how we leverage job queues.

What’s a Job Queue?

A typical job queue is an architectural pattern where processes submit computation jobs to a queue and other processes actually execute the jobs. This is usually a good thing—when used properly, it gives you degrees of concurrency, scalability, and redundancy that you can’t get with a traditional request–response paradigm. Many workers can be executing different jobs simultaneously in multiple processes, multiple machines, or even multiple data centers for peak concurrency. You can assign certain worker nodes to work on certain queues and send particular jobs to specific queues, allowing you to scale resources as needed. If a worker process crashes or a data center goes offline, other workers can execute the remaining jobs.

While you can certainly apply these principles and run a job-queueing system easily at a small scale, the seams start to show (and even burst) when you’re processing billions and billions of jobs. Let’s take a look at a few problems Braze has faced as we’ve grown from processing thousands, to millions, and now billions of jobs per day.

Check out the rest of this blog post at Building Braze!

Achieving Resiliency With Queues: Building A System That Never Skips A Beat In A Billion

What’s a Job Queue?

Published

Category

Tags