Job async executors typically execute long running tasks and / or use external services to execute jobs.
If an outage occurs, the whole microservice may stall or stop entirely whilst jobs are running. These jobs are managed by a watcher. All job executors run a watcher unless they are disabled. This scheduled thread watches all running jobs (across all job executors) in order to identify the following scenarios:
QUEUED jobs may be considered as “stalled” if they have not moved to a RUNNING state during a timeout deadline. QUEUED jobs that have stalled can be taken by another job executor with an available slot.
RUNNING jobs may be considered as “stalled” if they haven’t updated the progress before the timeout deadline. If a running job stalls, it is moved to a TIMED_OUT state.
TIMED_OUT jobs can be taken by any other job executor microservice with an available slot. If a job is taken by another job executor, it is moved to a RUNNING state.
TIMED_OUT jobs are moved back to a RUNNING state if the original job execution progresses.
TIMED_OUT jobs may be considered as “stalled” if they still haven’t updated the progress for the second deadline timeout.
TIMED_OUT jobs call the
resume() method for the same plugin.
resume() is implemented and the job progresses, it is moved to a RUNNING state.
resume() is not implemented or the job has not progressed, it is marked as FAILED.