In this post, we’ll explore the problem of scheduling and executing arbitrary tasks in a distributed infrastructure. Specifically, we’ll talk about how PagerDuty built an open-source library to solve this problem, and how our solution tackles a few different distributed systems problems.
Task scheduling is a problem common to many organizations. At PagerDuty, examples might include emailing an engineer two days before they go on-call, or push-notifying a responder 1 minute after an incident occurs. Expressed in very generic terms, PagerDuty engineers must be able to schedule arbitrary chunks of code (called tasks), at an arbitrary time, in the face of changing infrastructure.
The task might send an SMS or it might talk to a database; it could be synchronous or asynchronous. The task could be scheduled to run one minute from now, or one year from now. Meanwhile, the infrastructure that it was scheduled on might be entirely different than that on which it runs: next year we might use four data centers instead of three, or all of the virtual machines in a given microservice might have been replaced.
How can we ensure that our tasks run on time, in an orderly fashion, despite these changes?
The Old Solution
PagerDuty has an existing solution to this problem, called WorkQueue, and it’s been around for a few years. It works, but there was some room for improvement.
WorkQueue distributed tasks to different service instances via a partitioned queue in Apache Cassandra. The trouble was ensuring that each partition was always being worked. Different service instances were constantly peeking at the queue partitions to ensure they were advancing, as well as doing health checks on each other. If a service instance died at 3 AM, an on-call engineer would need to perform a complex sequence of steps to replace it.
WorkQueue was also quite slow. The only way that tasks were executed was by polling them from Cassandra. This frequent, slow IO actually influenced decisions on where to physically locate our DCs; they are currently all on the west coast of the US to minimize latency between our Cassandra nodes.
Lastly, WorkQueue put PagerDuty in a situation where only a few very experienced engineers understood this complex partitioning logic which supported many of our critical services. This is not a good idea from a risk management point of view (see Bus Factor), and it’s not fair to your experienced employees to always ask them to jump in and solve a certain class of problem.
The New Hotness
Considering the above problems, the Core team built a new solution to this same problem. Since it schedules things, we very originally called it “Scheduler”.
Like WorkQueue, Scheduler is written in Scala. It still uses Cassandra for task persistence, but it adds Apache Kafka to handle task queuing and partitioning, and Akka to structure the library’s concurrency.
Below is a high-level diagram of the new Scheduler and how it integrates into our services:
Some service, written in Scala, is shown in red in this diagram. It includes Scheduler as two separate libraries shown in green, one for the client and one for the implementation. The service’s logic schedules a task by passing it to a
scheduleTask method provided by the library.
The library, in turn, serializes the task metadata and enqueues it into Kafka. On the other end of the queue, Scheduler itself consumes tasks as they are sent. Scheduler always persists tasks to Cassandra to ensure they can’t be lost, but if a task is scheduled before a certain time in the future, it will remain in memory as well.
Scheduler is also fetching tasks out of Cassandra on a regular basis. The combination of in-memory tasks from Kafka and fetched tasks from Cassandra is executed, on-schedule, by calling a
TaskExecutor defined by the encompassing service. Because the task execution is defined by the encompassing service, Scheduler does not need to serialize and persist any task logic, only metadata.
Coming Up Next…
Hopefully this post has given you some idea why Scheduler exists, and some basics on how it works. In subsequent posts, we’ll discuss how Scheduler handles dynamic load, data center outages, all while keeping tasks in order! In the meantime, you can learn more about Scheduler by reading its code and documentation on GitHub.