Durable Execution: From where do deterministic constraints come?
Is it because of the system? Or because of your use case?
During a discussion about determinism and Durable Execution with
a super interesting point was made.With a log-based Durable Execution system, like Temporal and Restate, your “workflow code” needs to be deterministic to make the system happy. With an object set-based system, like Resonate, your “workflow code” needs to be deterministic to make your use-case happy.
I spent several years working on Temporal’s documentation. One of the challenges was making “getting started with Temporal” content approachable, but also guide developers around the pitfall of non-determinism in their workflows.
From my perspective, this was difficult because, as a developer, to craft workflows that could actually recover from crash failures, you inevitably needed to understand how replay worked. The catch-22 was that no one really wanted to spend the time to learn a deep technical concept just to adopt a new technology. And yet it became fairly necessary to address early on in a developer’s journey.
All that is to say, I took some deep dives into the inner workings of Temporal’s replay algorithm to try and understand where that line might be — the balance between learning replay and getting yourself started.
I am still not sure I can confidently say there is a one-size fits all approach there. Distributed Systems are complex, and every developer is on a different journey with a different set of prejudices towards patterns they like and don’t like.
However, I did walk away with a real interest in Durable Execution systems and confidence that comes from having a solid mental model. So, it has been incredibly fun to dive deep into how Resonate works and compare it with Temporal and others.
The insight about where deterministic constraints comes from (to make the system happy or to your use case happy) comes from exploring the method by which each system makes replay possible — replay being the means by which a function can recover after the process it was executing in crashes.
Log-based systems
Temporal and Restate both maintain a dedicated and sequenced log of steps per workflow invocation. Temporal calls it an Event History, Restate calls it a Journal.
When a workflow function needs to recover, it is re-executed in a new process and each step of the execution is compared to its log. To prevent unwanted side effects, values are pulled from the log for steps that are tracked there. This is done until the e execution progresses past the last step in the log.
Example log of a completed workflow function:
Workflow: foo started {args:[]}
Step: bar invoked {args: []}
Step: bar returned {result: 1}
Step: baz invoked {args: [1]}
Step: baz returned {result: 2}
Step: qux invoked {args: [2]}
Step: qux returned {result: 3}
Workflow: foo completed {result: 3)
For this to work, you have to separate the workflow code from steps that introduce any randomness, such as interactions with APIs or file systems, or generating random numbers. Because during a replay, the steps in the execution must match the steps in the log. If there is a mismatch, the system can not guarantee recovery.
Here’s a practical example, let’s say you add a random number generator to your workflow function, giving a 50% chance of invoking baz()
.
Example pseudo-code:
@workflow
func foo(ctx) {
result = await ctx.step(bar)
if( random(0-100) > 50 ) {
result = await ctx.step(baz, result)
}
result = await ctx.step(qux, result)
return result
}
// activity
func bar() {
// ...
return 1
}
// activity
func baz(arg) {
// ...
return arg + 1
}
func qux(arg) {
// ...
return arg + 1
}
Let’s say that on the first execution attempt baz()
is invoked and returns, but just after that the process crashes. The log might look like this:
Workflow: foo started {args:[]}
Step: bar invoked {args: []}
Step: bar returned {result: 1}
Step: baz invoked {args: [1]}
Step: baz returned {result: 2}
Let’s say another process picks up the task to execute the workflow function and executes it for the second time. However, on the second execution baz()
is not invoked. When the execution reaches the invocation of qux()
there is a mismatch between the current execution and the log and now the system doesn’t know what to do.
Workflow: foo started {args:[]}
Step: bar invoked {args: []}
Step: bar returned {result: 1}
Step: qux invoked {args: [1]} !DETERMINISTIC REPLAY ERROR
It doesn’t matter if your use case allows for non-determinism or not. The system can’t have that happen, or it can’t recover your workflow execution.
Object set-based systems
Resonate also replays function executions to recover after a process crashes.
However, instead of having a dedicated log and specifying a specific function for which the log applies, Resonate uses un-sequenced promises.
During the Replay of a function execution values are pulled from promises. When the execution reaches a promise for which there isn’t a result yet, it invokes the step (in most cases that’s a function) and waits for the value, indicating it has progressed further than the previous execution.
Consider the same situation as previously where there is a 50% chance of invoking baz(), and on the first execution baz() is invoked and returns.
The following promises would exist:
PromiseFoo: {Status: Pending, Result: Null}
PromiseBar: {Status: Resolved, Result: 1}
PromiseBaz: {Status: Resolved, Result: 2}
Again, assuming the process crashed after baz()
returned, foo()
would re-execute in a new process. The value of bar() would come from the previously completed promise to prevent duplicate side effects. However, in this case, the system would not prevent the execution from continuing if baz()
is skipped.
The execution would just progress to qux()
because there is no log sequence dictating that baz()
should be the next step. The resulting promise states would look like this:
PromiseFoo: {Status: Resolved, Result: 2}
PromiseBar: {Status: Resolved, Result: 1}
PromiseQux: {Status: Resolved, Result: 2}
Using this replay method, you don’t need to separate functions by which ones have to be deterministic and which ones can engage in randomness.
However, on the first execution baz() did execute to completion, and whether that is bad or not is up to your use case. Something undesirable could happen if you are not aware of how replay works.
Understanding replay
The theme that I see, is that to be successful with any Durable Execution platform, a developer will inevitably need to understand how replay works to recover executions after process crashes. There doesn’t seem to be a way around that.
A log-based approach could almost be considered a “guard-rail” approach, because a log-based system doesn’t allow progress past the point of contention, and in many cases enables a fix to be deployed to solve for it. The upfront learning curve includes both the system’s replay semantics as well as the primitives you need to use to work within the system, such as “workflows”, “events”, etc, thus warranting the “guard rails”.
An object set-based approach takes the guard rails off. So, it is important to understand replay and how it could affect your use case pretty early on. But since this approach builds on existing and well known primitives (functions and promises), the learning could be far less steep, leaving plenty of room for a developer to wrap their head around the concept without too much struggle.
There are many more aspects of these systems to compare — even on the topic of deterministic constraints, this barely scratches the surface. But for whatever reason I found the insight between the two types of systems incredibly compelling and worth thinking about.
I hope you enjoyed thinking about it too!