Difference between retry and recovery
Retry happens on the Local Event Loop, recovery happens on the Global Event Loop
In the context of Resonate there is a difference between retry and recovery.
Retry refers to the re-execution of a function when the function throws or raises an error during execution.
Recovery refers to the re-execution of a function in a new process when the process it was executing in crashes.
Due to the design of Resonate’s Task Framework, this means that retry happens to invocations scheduled on the Local Event Loop, whereas recovery happens to invocations scheduled on the Global Event Loop.
The Global Event Loop includes all processes capable of accepting invocation requests.
The Local Event Loop is within a single process.
Consider the following:
In the code example above, function foo()
, executing in process-a
, makes a Remote Function Call to bar()
.
This schedules the invocation of bar()
on the Global Event Loop.
Let’s say bar()
is available in both process-b
and process-c.
Let’s say process-b
acknowledges the invocation request first. The invocation of bar()
is then scheduled on process-b
’s Local Event Loop.
If bar()
throws an error, the invocation is again scheduled on the Local Event Loop, because process-b
is still active and has not given up trying to execute bar()
. This is a retry.
Retry happens in the context of control flow. That is — exceptions are catchable and can be handled by the component experiencing the failure.
If process-b
crashes and disappears while trying to execute bar()
then the invocation of bar()
is again scheduled on the Global Event Loop. This time process-c
acknowledges the invocation request, schedules it locally, and attempts to execute. This is recovery.
Recovery requires supervision typically in the form of a supervisor component. If recovery is happening, it means that the control flow was interrupted. Recovery also assumes that an operator of the system has multiple processes running that are capable of executing function bar()
. Recovery also assumes an operator is monitoring the health of the supervisor component and keeping it alive.
Recovery is also known as Durable Execution.